[00:05:35] <icinga-wm>	 PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:07:35] <icinga-wm>	 RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[00:09:51] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041762 (10brion)
[00:19:35] <icinga-wm>	 PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:22:18] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041776 (10bd808) I don't see anything at https://atlas.hashicorp.com/debian/boxes/contrib-jessie64 that explicitly says a re...
[00:26:35] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[00:29:35] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[00:29:45] <icinga-wm>	 RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 725 bytes in 0.083 second response time
[00:33:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:33:45] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[00:34:35] <icinga-wm>	 RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[00:45:35] <wikibugs>	 (03Abandoned) 10Krinkle: Remove unused top6-wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 (owner: 10Krinkle)
[00:47:35] <icinga-wm>	 RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[00:51:47] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3041805 (10brion)
[00:51:51] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041803 (10brion) 05Open>03Invalid Running with --debug seems to indicate that Vagrant's downloader is failing to load cu...
[00:54:35] <icinga-wm>	 PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:55:49] <wikibugs>	 (03PS3) 10Tim Starling: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086)
[00:56:45] <wikibugs>	 (03CR) 10Tim Starling: "Bryan: Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling)
[01:01:25] <icinga-wm>	 RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[01:04:45] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[01:07:21] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] "<bd808> TimStarling: yeah. give it a shot and lets see how fast the hard drive fills up :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling)
[01:08:59] <wikibugs>	 (03Merged) 10jenkins-bot: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling)
[01:09:07] <wikibugs>	 (03CR) 10jenkins-bot: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling)
[01:14:35] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[01:17:31] <logmsgbot>	 !log tstarling@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 42s)
[01:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:35] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[01:18:59] <Yvette>	 No justification provided, snap.
[01:23:35] <icinga-wm>	 RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[01:33:45] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[01:55:35] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[01:58:35] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[01:58:40] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041900 (10brion)
[02:03:30] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041913 (10brion) Note there is no `logs/mediawiki-runJobs.log` file, and I cannot connect to port 80 on 127.0.0.1 from within the...
[02:09:35] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[02:12:06] <wikibugs>	 (03PS1) 10Andrew Bogott: WIP:  Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918
[02:12:35] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[02:19:17] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 07m 20s)
[02:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:25] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:24:37] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 21 02:24:37 UTC 2017 (duration 5m 20s)
[02:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:10] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041919 (10bd808) Likely broken by {rMWVA1956f986abfe} where we dropped the port 80 bind.
[02:33:55] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.500 second response time
[02:38:55] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.382 second response time
[02:48:25] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[03:10:55] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:24:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 646.40 seconds
[03:27:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 230.32 seconds
[03:30:35] <icinga-wm>	 PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:37:55] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[03:50:25] <icinga-wm>	 PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:52:45] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3041958 (10brion)
[03:58:35] <icinga-wm>	 RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[04:09:25] <icinga-wm>	 PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:18:25] <icinga-wm>	 RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[04:33:50] <wikibugs>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team: npm install fails for changeprop service in MW-Vagrant jessie-migration - https://phabricator.wikimedia.org/T158617#3041968 (10brion)
[04:38:25] <icinga-wm>	 RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[04:38:55] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.232 second response time
[04:48:56] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.810 second response time
[05:06:35] <icinga-wm>	 PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:15:45] <wikibugs>	 (03PS1) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929
[05:16:35] <wikibugs>	 (03CR) 10Krinkle: "The last value in this schema was received in March 2016 per https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Dep" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle)
[05:17:15] <wikibugs>	 (03PS2) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929
[05:17:57] <wikibugs>	 (03CR) 10Krinkle: "@ops: What's the convention for ensuring this service is stopped? Or can we just do it manually?" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle)
[05:34:35] <icinga-wm>	 RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[05:45:45] <icinga-wm>	 PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:46:35] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[05:46:55] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[06:02:08] <subbu>	 is gerrit really slow or just for me? and git pull --rebase on the puppet repo has been failing for me for the last 2-3 hours.
[06:13:45] <icinga-wm>	 RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:38:56] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.308 second response time
[06:44:05] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.077 second response time
[06:50:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:12:19] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416)
[07:13:55] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.486 second response time
[07:18:55] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.597 second response time
[07:19:25] <icinga-wm>	 RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[07:26:07] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[07:27:36] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[07:27:57] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[07:29:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2048 and depool db2055 - T132416 (duration: 00m 51s)
[07:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:11] <stashbot>	 T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416
[07:31:37] <marostegui>	 !log Deploy alter table enwiki.revision db2055 - T132416
[07:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:55] <wikibugs>	 (03CR) 10Muehlenhoff: "You could use" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle)
[08:06:35] <icinga-wm>	 PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:16:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM but see unrelated change which slipped in?" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul)
[08:22:21] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(27|32|37|41) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338811 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel)
[08:24:06] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(27|32|37|41).eqiad.wmnet
[08:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:27:19] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042192 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1027.eqiad.wmnet'] ``` The...
[08:27:41] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042193 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The...
[08:27:44] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042194 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1041.eqiad.wmnet'] ``` The...
[08:27:59] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042195 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1037.eqiad.wmnet'] ``` The...
[08:28:55] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.480 second response time
[08:30:24] <gehel>	 !log increasing concurrent recoveries / relocations to 8 on elasticsearch eqiad
[08:30:25] <icinga-wm>	 PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[08:30:27] <wikibugs>	 (03CR) 10Gilles: [C: 031] webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle)
[08:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:19] <moritzm>	 !log upgrading mw1170-mw1208 to HHVM 3.12.14
[08:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:35] <icinga-wm>	 RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[08:34:43] <wikibugs>	 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3040563 (10fgiunchedi) +1, sounds good to me!
[08:38:55] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.455 second response time
[08:41:26] <icinga-wm>	 RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[08:42:48] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3042204 (10jcrespo) a:05akosiaris>03None
[08:43:04] <wikibugs>	 06Operations, 10hardware-requests: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3042206 (10jcrespo)
[08:43:06] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10jcrespo)
[08:44:27] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042210 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1027.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1027.eqi...
[08:44:29] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10jcrespo) I arrived here through cronspam, added T156506 as a subtask (meaning it depends on, it is obviously not a subtask) so others do not lose time next time.
[08:48:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "graphite: switch to graphite2001" [dns] - 10https://gerrit.wikimedia.org/r/338938 (https://phabricator.wikimedia.org/T157022)
[08:52:14] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042218 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1041.eqiad.wmnet'] ```  and were **ALL** successful.
[08:52:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Revert "graphite: switch to graphite2001" [dns] - 10https://gerrit.wikimedia.org/r/338938 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi)
[08:52:25] <icinga-wm>	 RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[08:52:34] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042219 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1037.eqiad.wmnet'] ```  and were **ALL** successful.
[08:53:27] <godog>	 !log switch statsd/graphite DNS to graphite1001 - T157022
[08:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:33] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[09:01:35] <icinga-wm>	 PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg]
[09:07:05] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042240 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1032.eqi...
[09:13:19] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042241 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1027.eqiad.wmnet'] ``` The...
[09:13:21] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042242 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The...
[09:16:37] <elukey>	 godog: need to restart all the jmxtrans on analytics to pick up graphite1001?
[09:18:30] <godog>	 elukey: heh I think so, let's wait another 45m or so for the ttl to be fully expired and see what still goes to graphite2001
[09:18:50] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042244 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1032.eqi...
[09:19:39] <elukey>	 godog: sure! Let me know if I have to run a round of restarts to help
[09:20:52] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:21:16] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042245 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The...
[09:21:18] <gehel>	 ^elastic1032 is me - reimage in progress...
[09:26:00] <ema>	 !log cp3030: libssl1.1 upgraded to 1.1.0e-1+wmf1, libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1
[09:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:46] <icinga-wm>	 RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[09:35:36] <icinga-wm>	 PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:35:49] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042278 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1027.eqiad.wmnet'] ```  and were **ALL** successful.
[09:36:16] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[09:36:44] <elukey>	 checking --^
[09:36:50] <moritzm>	 it's me, upgrading
[09:36:59] <moritzm>	 it's depooled
[09:37:05] <elukey>	 \o/
[09:37:09] <elukey>	 thanks :)
[09:37:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[09:37:43] <elukey>	 moritzm: does conf-tool have any effect on jobrunners?
[09:38:16] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[09:38:18] <elukey>	 (except DSH)
[09:39:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[09:39:36] <icinga-wm>	 PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg]
[09:39:37] <wikibugs>	 (03PS2) 10Hashar: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490
[09:41:20] <wikibugs>	 (03CR) 10Elukey: [C: 031] "Weird that ${$mirror_name} didn't trigger any issue, it must evaluate as ${mirror_name} (I checked on kafka1001 and the substitution was r" [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys)
[09:42:56] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[09:43:56] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[09:44:26] <wikibugs>	 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3042303 (10Addshore)
[09:46:30] <godog>	 !log upgrade graphite on graphite1001 and bounce carbon daemons
[09:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:54] <hashar>	 I have some funny oddity with graphite :-}
[09:50:41] <hashar>	 the .upper metric looks off using the week bucket bah
[09:51:09] <hashar>	 godog: you have been upgrading Graphite haven't you ?
[09:51:13] <wikibugs>	 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2776612 (10aude) @Lydia_Pintscher only thing that changes is that interwiki links are sorted in all names...
[09:51:24] <hashar>	 or Carbon or statsd maybe?
[09:52:06] <godog>	 hashar: the former yeah
[09:53:06] <hashar>	 the .upper is behaving strangely with the last 7 days of data
[09:53:11] <hashar>	 but the long aggregation looks fine
[09:53:12] <hashar>	 https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=13&fullscreen&from=now-14d&to=now :}
[09:53:29] <hashar>	 should be around 2 minutes
[09:53:52] <hashar>	 but for the last 7 days of data (which I assume is the bucket of 7 days that keeps the per minute data points)  it does not seem to keep the max
[09:54:06] <hashar>	 will fill a bug/ try to reproduce with the graphite web iface
[09:54:54] <godog>	 hashar: yeah please file a bug for that!
[09:55:28] <wikibugs>	 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3042332 (10Addshore) Indeed!
[09:55:44] <hashar>	 median is not affected :}
[09:57:32] <wikibugs>	 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#3042338 (10Addshore) 05Open>03Resolved a:03Addshore
[09:58:01] <moritzm>	 !log upgrading mira/tin to HHVM 3.12.14
[09:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:36] <icinga-wm>	 PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:03:36] <icinga-wm>	 RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[10:08:11] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042390 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ```  and were **ALL** successful.
[10:08:36] <icinga-wm>	 RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[10:10:57] <wikibugs>	 (03PS1) 10Jcrespo: Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941
[10:11:29] <hashar>	 godog: filled https://phabricator.wikimedia.org/T158633  I can live without .upper .lower for a while.  The .median is apparently not affected :}
[10:12:15] <godog>	 hashar: ok thanks!
[10:12:37] <hashar>	 I haven't tried on other time based metrics
[10:13:16] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic1032 is OK: SSL OK - Certificate elastic1032.eqiad.wmnet valid until 2022-02-20 10:11:41 +0000 (expires in 1824 days)
[10:13:31] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941 (owner: 10Jcrespo)
[10:14:16] <icinga-wm>	 PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 25 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg]
[10:14:52] <godog>	 !log downgrade carbon-c-relay on graphite1001 to trusty's version and bounce daemons
[10:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:13] <wikibugs>	 (03PS1) 10Jcrespo: Update s5-pager to the latest version (including extra indexes) [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747)
[10:19:16] <icinga-wm>	 RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[10:22:13] <wikibugs>	 (03PS2) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588)
[10:24:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941 (owner: 10Jcrespo)
[10:27:36] <icinga-wm>	 RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[10:29:53] <moritzm>	 !log restarting base services on mw2* after openssl update
[10:29:56] <godog>	 elukey: yeah if you could start a rolling restart of jmxtrans that'd be nice! make sure to mention T157022 in !log so we can keep track of it
[10:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:59] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[10:30:03] <godog>	 I have to go in ~15 min
[10:31:06] <elukey>	 yessir!
[10:55:50] <elukey>	 !log rolling restart of the analyics jmxtrans daemons for T157022
[10:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:55] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[10:57:28] <wikibugs>	 (03PS1) 10Subramanya Sastry: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950
[10:59:25] <wikibugs>	 (03PS3) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588)
[11:02:14] <elukey>	 !log rolling restart of cassandra-metrics-collector on aqs1* for T157022
[11:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:20] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[11:03:01] <elukey>	 godog: I've done the whole Hadoop cluster, Druid and AQS nodes.. IIRC it should be enough, but let me know if I am missing something
[11:05:12] <moritzm>	 !log upgrading openssl on hadoop cluster / various base service restarts
[11:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:16] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:15:23] <moritzm>	 !log upgrading openssl on restbase clusters / various base service restarts
[11:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:36] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:18:56] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.235 second response time
[11:19:44] <tabbycat>	 second coffee in a 2 hours row and still feeling asleep :S
[11:26:29] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546)
[11:27:20] <hashar>	 tabbycat: then you should have an half an hour sleep session :}
[11:27:50] <hashar>	 tabbycat:  or try to get outside for a while for some fresh air. That might wake you up ? :)
[11:29:00] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.876 second response time
[11:29:51] <wikibugs>	 (03PS22) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:29:59] <wikibugs>	 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Kaarel_Vaidla)
[11:32:52] <moritzm>	 !log upgrading openssl on kafka clusters / various base service restarts
[11:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:44] <wikibugs>	 (03PS2) 10Elukey: Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023)
[11:34:20] <wikibugs>	 (03PS23) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:36:07] <wikibugs>	 (03CR) 10Elukey: [C: 032] Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey)
[11:37:25] <wikibugs>	 (03PS1) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558)
[11:39:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema)
[11:39:57] <wikibugs>	 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3040563 (10faidon) `apt::repository` has a `comment_old` option that comments out the line in sources.list, and `Apt::Repository[wikimedia]` sets that to `true`, so this should be the case already. If it's not, it's probably...
[11:40:20] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[11:40:36] <wikibugs>	 (03CR) 10Volans: "Last puppet compiler, including the require of cumin's package:" [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[11:42:21] <wikibugs>	 (03PS2) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558)
[11:42:40] <icinga-wm>	 PROBLEM - DPKG on poolcounter1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:43:01] <moritzm>	 ^ poolcounter1002 is harmless, update in progress
[11:43:40] <icinga-wm>	 RECOVERY - DPKG on poolcounter1002 is OK: All packages OK
[11:45:40] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[11:49:40] <icinga-wm>	 PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:50:22] <wikibugs>	 (03PS24) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:54:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[11:57:19] <wikibugs>	 (03CR) 10Ema: "PCC output here https://puppet-compiler.wmflabs.org/5518/" [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema)
[12:01:06] <volans>	 !log temporarily disabled puppet on neodymium and puppetmaster1001 to merge Gerrit 330436 T154588
[12:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:12] <stashbot>	 T154588: Automation framework first version - https://phabricator.wikimedia.org/T154588
[12:02:13] <wikibugs>	 (03PS25) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[12:03:58] <wikibugs>	 (03CR) 10Volans: [C: 032] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[12:05:00] <icinga-wm>	 PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:07:40] <icinga-wm>	 PROBLEM - Check systemd state on mw2245 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:50] <icinga-wm>	 PROBLEM - Check systemd state on mw1255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:50] <icinga-wm>	 PROBLEM - Check systemd state on mw2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:50] <icinga-wm>	 PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:51] <icinga-wm>	 PROBLEM - Check systemd state on db1085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:51] <icinga-wm>	 PROBLEM - Check systemd state on sarin is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:07:52] <volans>	 that's ferm, chcecking
[12:07:58] <volans>	 argh...
[12:08:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:08:10] <icinga-wm>	 PROBLEM - Check systemd state on mc2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:08:10] <icinga-wm>	 PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:08:11] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3042597 (10kruusamagi) I removed the date info from the main page of Estonian Wikipedia, but it only helps to hide the issue and not to solve it (the weekl...
[12:08:32] <volans>	 !log stopped ircecho temporarily while fixing ferm
[12:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:19] <wikibugs>	 (03PS1) 10Volans: Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588)
[12:12:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[12:12:36] <wikibugs>	 (03CR) 10Volans: [C: 032] Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[12:14:21] <wikibugs>	 (03PS1) 10Gilles: Increase SWIFT_RETRIES in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/338957 (https://phabricator.wikimedia.org/T157949)
[12:14:39] <volans>	 fix works properly
[12:28:16] <ema>	 volans: still lots of CRITs in icinga, is that because of ferm?
[12:28:58] <volans>	 ema: yes, I'm running puppet, they're recovering
[12:29:08] <ema>	 ok
[12:29:08] <volans>	 sorry about the mess
[12:29:12] <ema>	 no worries
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1225 is OK: OK - running: The system is fully operational
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on darmstadtium is OK: OK - running: The system is fully operational
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1162 is OK: OK - running: The system is fully operational
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1263 is OK: OK - running: The system is fully operational
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1271 is OK: OK - running: The system is fully operational
[12:31:25] <icinga-wm>	 RECOVERY - Check systemd state on argon is OK: OK - running: The system is fully operational
[12:31:25] <icinga-wm>	 RECOVERY - Check systemd state on mw2105 is OK: OK - running: The system is fully operational
[12:31:37] <volans>	 shut up ircecho :)
[12:38:33] <wikibugs>	 (03PS1) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023)
[12:39:38] <volans>	 !log reenabled ircecho aftrer fixing ferm issue and run puppet on affected hosts
[12:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:56] <volans>	 ema: all done, back to normal
[12:40:09] <ema>	 volans: nice, thanks!
[12:40:31] <wikibugs>	 (03PS2) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023)
[12:42:08] <icinga-wm>	 RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys.
[12:51:50] <volans>	 !log re-enabled puppet on planet2001, was disabled since a week without reason
[12:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:51] <volans>	 !log re-enabled puppet on neodymium and puppetmaster1001 after Gerrit 330436 was merged T154588
[12:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:56] <stashbot>	 T154588: Automation framework first version - https://phabricator.wikimedia.org/T154588
[12:55:03] <moritzm>	 !log upgrading openssl on database servers / various base service restarts
[12:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:09] <wikibugs>	 (03PS1) 10Phuedx: Enable "reading depth" logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639)
[13:04:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:06:53] <moritzm>	 !log upgrading openssl on parsoid clusters / various base service restarts
[13:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:59] <wikibugs>	 (03CR) 10Paladox: [C: 031] "@Jcrespo hi, we can still merge this and at a later date put this into a separate repo. But as you pointed out your buisy so why reject th" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[13:15:15] <wikibugs>	 (03PS2) 10Phuedx: Enable ReadingDepth instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639)
[13:20:39] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[13:21:44] <moritzm>	 !log upgrading openssl on aqs cluster / various base service restarts
[13:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:05] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3042690 (10Paladox) We can still do https://gerrit.wikimedia.org/r/#/c/336002/ since I doint see an urgency to ha...
[13:23:39] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[13:25:29] <wikibugs>	 (03PS3) 10Phuedx: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639)
[13:26:23] <wikibugs>	 (03CR) 10Bmansurov: [C: 031] Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx)
[13:31:59] <icinga-wm>	 RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[13:37:09] <icinga-wm>	 PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Cannot connect to keyholder-proxy socket /run/keyholder/proxy.sock.
[13:41:05] <elukey>	 !log restarting nodejs on aqs1* to pick up openssl security upgrades
[13:41:09] <icinga-wm>	 RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys.
[13:41:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:59] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 3 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2947598 (10faidon) >>! In T155578#3037355, @EBernhardson wrote: > I poked paravoid about if we had any better solutions this time aroun...
[13:42:38] <wikibugs>	 (03CR) 10Hashar: Support Jenkins install from 'experimental' component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar)
[13:45:09] <icinga-wm>	 PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[13:45:16] <wikibugs>	 (03PS3) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023)
[13:46:09] <icinga-wm>	 RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys.
[13:46:53] <wikibugs>	 (03PS6) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385
[13:46:55] <wikibugs>	 (03PS9) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388
[13:46:57] <wikibugs>	 (03PS3) 10Hashar: jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836
[13:48:24] <wikibugs>	 (03PS4) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023)
[13:49:39] <volans>	 keyholder on sarin it's me
[13:57:33] * kart_ waves for SWAT
[14:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1400).
[14:00:04] <jouncebot>	 kart_, addshore, jan_drewniak, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[14:00:14] <hashar>	 o/
[14:00:14] <zeljkof>	 o/
[14:00:19] <kart_>	 \0
[14:00:21] <phuedx>	 o/
[14:00:30] <hashar>	 kart_: doing your
[14:00:31] <jan_drewniak>	 o/
[14:00:36] <kart_>	 cool.
[14:00:50] <zeljkof>	 hashar: looks like you are in charge, ping me if you need help :)
[14:00:50] <moritzm>	 !log upgrading openssl on maps clusters / various base service restarts
[14:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:15] <wikibugs>	 (03PS1) 10Jcrespo: prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970
[14:01:34] <hashar>	 kart_: your patch is now on mwdebug1001
[14:01:57] <hashar>	 looking at phuedx one
[14:02:10] <kart_>	 hashar: sure. Testing. Give me few minutes.
[14:02:27] <hashar>	 our configuration files are insane
[14:03:12] <phuedx>	 hashar: ^
[14:03:19] <phuedx>	 👍
[14:03:23] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Filippo: let's talk, we probably need to do some changes to prometheus configuration for mysql-exporter connection, both the socket and in" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo)
[14:03:57] <wikibugs>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx)
[14:04:49] <wikibugs>	 (03PS2) 10Hashar: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:04:58] <hashar>	 jan_drewniak: and I rebased your portal patch :}
[14:05:19] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx)
[14:06:14] <hashar>	 phuedx: ReadingDepth logging is enabled on mwdebug1001
[14:06:25] <hashar>	 phuedx: then I guess there is no good way to test it is there?
[14:06:56] <phuedx>	 hashar: there's no clean way to test it -- i can see what values are getting forwarded to the client
[14:07:02] <hashar>	 jan_drewniak: do you want to deploy the change yourself or should I ?
[14:07:14] <hashar>	 phuedx: guess I can just push it cluster wide
[14:07:21] <wikibugs>	 (03CR) 10jenkins-bot: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx)
[14:07:37] <jan_drewniak>	 hashar: could you? thanks
[14:07:43] <kart_>	 hashar: go ahead.
[14:07:44] <wikibugs>	 (03CR) 10Hashar: [C: 032] "It is SWAT time :}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:07:49] <hashar>	 jan_drewniak: I will
[14:08:00] <hashar>	 kart_: thanks for the test :}  Pushing it to prod
[14:08:21] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable ReadingDepth logging on Wikipedias - T148262 T155639 (duration: 00m 45s)
[14:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:26] <stashbot>	 T148262: Vet and explore new readership engagement metric - https://phabricator.wikimedia.org/T148262
[14:08:26] <stashbot>	 T155639: Create reading depth schema - https://phabricator.wikimedia.org/T155639
[14:08:51] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:08:59] <icinga-wm>	 PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:09:01] <wikibugs>	 06Operations, 10netops: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3042814 (10faidon)
[14:09:43] <wikibugs>	 (03CR) 10jenkins-bot: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:09:49] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:09:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "I do not see any urgency on this and unblocking it- the bug it will fix will be blocked anyway by many alter table , that I also have to d" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[14:10:14] <logmsgbot>	 !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/: Fix broken site picks feature for compact language links (duration: 01m 04s)
[14:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:39] <phuedx>	 hashar: thanks ❤️
[14:11:02] <hashar>	 ;D
[14:11:14] <hashar>	 jan_drewniak: I have pushed it to mwdebug1001
[14:11:47] <hashar>	 looks like it is all fine 
[14:12:01] <jan_drewniak>	 hashar: yup! 
[14:12:37] <logmsgbot>	 !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s)
[14:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:18] <logmsgbot>	 !log hashar@tin Synchronized portals: (no justification provided) (duration: 00m 41s)
[14:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:56] <kart_>	 hashar: If we want to run script on a wiki, that script has to be in same branch as wiki in production (script will be in production tomorrow)?
[14:18:08] <hashar>	 kart_: it depends on the script I guess ? :}
[14:18:14] <hashar>	 jan_drewniak: completed :}
[14:18:38] <wikibugs>	 (03PS2) 10Hashar: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore)
[14:18:51] <wikibugs>	 (03CR) 10Hashar: [C: 032] Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore)
[14:19:50] <addshore>	 *waves*
[14:20:01] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[14:20:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore)
[14:20:13] <addshore>	 apparently I missed the first ping!
[14:20:20] <wikibugs>	 (03CR) 10jenkins-bot: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore)
[14:20:52] <Zppix>	 jouncebot:  now
[14:20:52] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1400)
[14:20:59] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3871050 keys, up 113 days 5 hours - replication_delay is 0
[14:21:33] <kart_>	 hashar: script is at, https://gerrit.wikimedia.org/r/#/c/336073/ and to be run on svwiki
[14:22:11] <wikibugs>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042853 (10Zppix) >>! In T158490#3039427, @Platonides wrote: > MW is the country-code of Malawi (ISO 3166-1), so I find unlikely we would be...
[14:22:25] <hashar>	 addshore: deploying :}
[14:22:29] <wikibugs>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042854 (10Zppix) >>! In T158490#3039567, @Matthewrbowker wrote: >>>! In T158490#3039326, @Zppix wrote: >> @Aklapper I meant like if abbrev'd...
[14:22:30] <addshore>	 thanks!
[14:22:39] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[14:22:52] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable TwoColConflict extension on arwiki - T158493 (duration: 00m 40s)
[14:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:58] <stashbot>	 T158493: Deploy TwoColConflict beta to Arabic Wikipedia - https://phabricator.wikimedia.org/T158493
[14:23:27] <hashar>	 kart_: you would have to backport the script from the master branch to the branch that svwiki is running .  That is wmf.12 ( http://tools.wmflabs.org/versions/ )
[14:23:29] <icinga-wm>	 PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 3456 MB (12% inode=96%)
[14:23:42] <hashar>	 kart_: then CR+2 ,  deploy it and you will be able to run the script on terbium
[14:24:01] <hashar>	 assuming the script does not depend on some code that got introduced between wmf.12 and master
[14:24:06] <kart_>	 hashar: nice. I'll do that as a part of SWAT.
[14:24:13] <kart_>	 hashar: no. It doesn't.
[14:24:14] <hashar>	 sure thing!
[14:24:18] <kart_>	 Thanks!
[14:24:28] <hashar>	 let me do the dance :D
[14:24:48] <hashar>	 https://gerrit.wikimedia.org/r/#/c/338971/1  and CR+2
[14:25:04] <moritzm>	 !log upgrading openssl on memcached clusters / various base service restarts
[14:25:06] <kart_>	 Oops :)
[14:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:15] <kart_>	 hashar: I also did cherry-pick :/
[14:25:29] <hashar>	 hehe
[14:25:39] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[14:26:04] <chasemp>	 ^ looking
[14:26:19] <kart_>	 hashar: Let's keep it for tomorrow?
[14:26:28] <paravoid>	 chasemp: it has been flapping for hours
[14:26:37] <paravoid>	 (fyi, since you're investigating)
[14:26:42] <chasemp>	 thanks
[14:26:44] <hashar>	 kart_: I will deploy it and you can run it later today or tomorrow :)
[14:26:53] <kart_>	 hashar: Okay. cool.
[14:27:09] <jan_drewniak>	 hashar: also question, I don't think the portals sync-script worked :/
[14:27:20] <hashar>	 eeek
[14:27:42] <hashar>	 jan_drewniak: maybe it failed to purge the URLs?
[14:28:22] * hashar tries again
[14:28:57] <logmsgbot>	 !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s)
[14:29:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:05] <moritzm>	 !log restarting NTP servers on dns_recursors to pick up openssl update (one by one)
[14:29:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:29] <icinga-wm>	 RECOVERY - Disk space on elastic1030 is OK: DISK OK
[14:29:37] <logmsgbot>	 !log hashar@tin Synchronized portals: (no justification provided) (duration: 00m 40s)
[14:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:43] <dcausse>	 !log truncated main log file on elastic1030
[14:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:02] <hashar>	 jan_drewniak: I redeployed and  https://www.wikipedia.org/ should have been purged
[14:30:14] <hashar>	 jan_drewniak: maybe other urls need a purge as well?
[14:32:28] <wikibugs>	 (03PS1) 10Rush: labstore: 1001 and 1002 are currently idle [puppet] - 10https://gerrit.wikimedia.org/r/338973
[14:32:49] <icinga-wm>	 PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[14:33:00] <jan_drewniak>	 the script syncs `portals/prod/wikipedia.org/assets $*` and `portals $*` maybe it needs  `portals/prod/wikipedia.org  $*` ? 
[14:33:49] <icinga-wm>	 RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.000232 secs
[14:35:17] <moritzm>	 !log upgrading openssl on logstash cluster / various base service restarts
[14:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:59] <icinga-wm>	 RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[14:37:00] <kart_>	 hashar: did you cherry pick script to wmf12?
[14:38:49] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[14:39:14] <hashar>	 kart_: waited for it to merge
[14:39:17] <hashar>	 still waiting :}
[14:39:59] <icinga-wm>	 PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[14:40:39] <kart_>	 hashar: now :)
[14:40:59] <icinga-wm>	 RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 2e-06 secs
[14:41:37] <hashar>	 kart_: deploying
[14:42:38] <hashar>	 syncing
[14:43:14] <logmsgbot>	 !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php: Add a maintenance script for opt-in T133031 (duration: 00m 41s)
[14:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:19] <stashbot>	 T133031: Preference conversion for Compact Language Links - https://phabricator.wikimedia.org/T133031
[14:43:58] <kart_>	 hashar: do we have space for one more patch?
[14:44:03] <hashar>	 yu
[14:44:05] <hashar>	 yes
[14:44:14] <hashar>	 and the ULS maintenance script should now be on terbium
[14:44:39] <kart_>	 hashar: cool.
[14:44:59] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 152299
[14:45:53] <kart_>	 hashar: waiting for Jenkins for, https://gerrit.wikimedia.org/r/#/c/338974/1
[14:46:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix debdeploy group for kubernetes-mastes [puppet] - 10https://gerrit.wikimedia.org/r/338975
[14:47:49] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:48:19] <hashar>	 kart_: arharhh
[14:48:33] <hashar>	 that code has an horrible look'n feel :D
[14:49:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix debdeploy group for kubernetes-mastes [puppet] - 10https://gerrit.wikimedia.org/r/338975 (owner: 10Muehlenhoff)
[14:50:35] <kart_>	 hashar: which code?
[14:52:38] <wikibugs>	 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042911 (10hashar)
[14:52:51] <wikibugs>	 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042924 (10hashar)
[14:53:14] <wikibugs>	 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042911 (10hashar)
[14:53:17] <Nikerabbit>	 hashar: we also have a task to make it more sane ;)
[14:55:46] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - reimage elastic10(33|34|38|42) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338977 (https://phabricator.wikimedia.org/T151326)
[14:56:14] <hashar>	 Nikerabbit: kart_ I have CR+2 the wmf.12 patch https://gerrit.wikimedia.org/r/#/c/338976/
[14:56:40] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(33|34|38|42).eqiad.wmnet
[14:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:19] <kart_>	 hashar: okay! Let me know once on mwdebug1001
[15:00:12] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(33|34|38|42) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338977 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel)
[15:02:25] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042981 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet'] ``` The...
[15:02:33] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042984 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1034.eqiad.wmnet'] ``` The...
[15:02:37] <wikibugs>	 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 13Patch-For-Review, and 2 others: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#3042986 (10Addshore)
[15:04:10] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043007 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1038.eqiad.wmnet'] ``` The...
[15:04:19] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043008 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The...
[15:05:09] <icinga-wm>	 PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:05:39] <icinga-wm>	 PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[15:06:25] <elukey>	 !log Increased manually maximum httpd keep alive requests and timeout on bohrium (piwik) - T154558
[15:06:30] <godog>	 !log roll-restart restbase after statsd move to graphite1001 - T157022
[15:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:30] <stashbot>	 T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558
[15:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:34] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[15:06:37] <gehel>	 !log restarting kartotherian / tilerator(ui) on maps-test*
[15:06:39] <icinga-wm>	 PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:47] <godog>	 phew for a moment I though I got the task number wrong
[15:07:03] <Zppix>	 godog:  lol
[15:07:03] <elukey>	 ahahaha
[15:07:26] <Zppix>	 godog:  nah it was just elukey's early april fools joke
[15:07:38] <godog>	 Zppix: hehe early april
[15:07:39] <icinga-wm>	 RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.0002 secs
[15:08:22] <wikibugs>	 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3043033 (10hashar) p:05Triage>03Low a:03hashar
[15:08:26] <wikibugs>	 (03PS1) 10Hashar: mediawiki-firejail: lint python scripts [puppet] - 10https://gerrit.wikimedia.org/r/338978 (https://phabricator.wikimedia.org/T158649)
[15:08:28] <wikibugs>	 (03PS1) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649)
[15:08:30] <wikibugs>	 (03PS1) 10Hashar: mediawiki-firejail: quiet firejail [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649)
[15:09:33] <hashar>	 moritzm: more patches for you ^^^ :D  firejail emits a message to stderr that ends up in logstash hhvm logs :D
[15:09:55] <gehel>	 !log restarting kartotherian / tilerator(ui) on maps2*
[15:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:01] <moritzm>	 hashar: thanks, I'll have a look this evening
[15:11:31] <wikibugs>	 (03CR) 10Hashar: "I gave it a try on Jessie and they work.  Not sure whether those scripts are used on Trusty on which the firejail command might not suppor" [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[15:11:48] <kart_>	 hashar: patch on wmf12 merged.
[15:11:58] <wikibugs>	 (03CR) 10Hashar: "I have no idea whether we are interested in catching firejail stdout which --quiet disable as well." [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[15:12:15] <hashar>	 kart_: yeah pushing to mwdebug1001
[15:12:40] <gehel>	 !log restarting kartotherian / tilerator(ui) on maps1*
[15:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:03] <hashar>	 kart_: done. it is on mwdebug1001 now.
[15:14:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "@Jcrespo, indeed! The socket should be enough to tweak hieradata for the labs::db roles, we'll need to check socket auth though" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo)
[15:15:21] <kart_>	 hashar: go ahead. OK this time!
[15:15:32] <kart_>	 (ie really tested)
[15:15:49] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[15:17:00] <logmsgbot>	 !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/UniversalLanguageSelector.hooks.php: Fix site picks: missing from globals (duration: 01m 00s)
[15:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:21] <hashar>	 kart_: done :)
[15:17:26] <hashar>	 !log European SWAT complete
[15:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:00] <logmsgbot>	 !log mobrovac@tin Started restart [mathoid/deploy@ba3217e]: Restarting for Graphite DNS switch T157022
[15:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:05] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[15:18:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "Also FWIW this change can go ahead, prometheus will scrape prometheus-mysqld-exporter successfully. The failure to contact mysql itself is" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo)
[15:18:25] <kart_>	 hashar: thanks!
[15:18:39] <icinga-wm>	 RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:19:23] <logmsgbot>	 !log mobrovac@tin Started restart [citoid/deploy@95df861]: Restarting for Graphite DNS switch T157022
[15:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:09] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disables drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043080 (10Paladox)
[15:20:22] <moritzm>	 !log rolling restart of swift frontend servers to pick up openssl update
[15:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:32] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1034.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1034.eqi...
[15:21:17] <logmsgbot>	 !log mobrovac@tin Started restart [cxserver/deploy@0e4ae4f]: Restarting for Graphite DNS switch T157022
[15:21:24] <mobrovac>	 kart_: fyi ^
[15:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:30] <elukey>	 !log restart eventlogging on kafka200[123] for openssl upgrades
[15:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:41] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disables drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043111 (10Paladox) Here is the changes https://gerrit-review.googlesource.com/#/q/topic:private-changes+(status:open+OR+status:merged) that will bring support for priva...
[15:24:00] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043112 (10Paladox)
[15:24:19] <icinga-wm>	 PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[15:24:26] <kart_>	 mobrovac: thanks. Any action from us?
[15:24:39] <icinga-wm>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:26:01] <mobrovac>	 kart_: nope, just switching stats backends
[15:26:19] <icinga-wm>	 RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -0.001388 secs
[15:26:39] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[15:27:35] <elukey>	 mobrovac: kafka2001 (sorry I was already working on it) has been depooled, restarted, waited a bit, repooled and checked with httpry. Everything seems good
[15:27:46] <elukey>	 I am going to do the rest on kafka100[123] first
[15:28:19] <elukey>	 err too many things, it was kafka1001 indeed
[15:28:38] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043122 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1038.eqiad.wmnet'] ```  and were **ALL** successful.
[15:29:08] <mobrovac>	 lol elukey
[15:29:18] <mobrovac>	 elukey: kk, both eventbus and cp look good
[15:29:41] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[15:30:20] <logmsgbot>	 !log mobrovac@tin Started restart [graphoid/deploy@da37386]: Restarting for Graphite DNS switch T157022
[15:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:24] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[15:31:21] <icinga-wm>	 PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[15:32:08] <elukey>	 !log correction on my previous entry: restart eventlogging on kafka100[123] for openssl upgrades
[15:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:28] <elukey>	 this is why I got confused, I've read the log and though "snap I did the wrong thing!"
[15:32:32] <elukey>	 need coffee
[15:32:42] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043129 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1042.eqi...
[15:33:01] <icinga-wm>	 RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[15:33:07] <elukey>	 kafka1002 done
[15:33:10] <elukey>	 all good..
[15:33:14] <elukey>	 going to finish with 1003
[15:33:21] <icinga-wm>	 RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.000525 secs
[15:34:41] <logmsgbot>	 !log mobrovac@tin Started restart [mobileapps/deploy@cd3b897]: Restarting for Graphite DNS switch T157022
[15:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:06] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043172 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ```  and were **ALL** successful.
[15:36:19] <elukey>	 mobrovac: eqiad done :)
[15:36:39] <mobrovac>	 elukey: double \o/ as all is looking good
[15:37:00] <mobrovac>	 hm, weird how happy we are when technology works as it's supposed to
[15:37:04] <mobrovac>	 makes one wonder ...
[15:37:24] <elukey>	 mobrovac: proceeding with codfw ok?
[15:37:30] <wikibugs>	 06Operations, 06Operations-Software-Development: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043173 (10Volans) p:05Triage>03High a:03Volans
[15:37:50] <mobrovac>	 elukey: kk
[15:38:08] <wikibugs>	 (03PS1) 10Volans: Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984
[15:38:41] <icinga-wm>	 PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
[15:39:06] <wikibugs>	 (03PS2) 10Jcrespo: prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970
[15:39:23] <elukey>	 !log restart jmxtrans on kafka[12]00[123] for T157022
[15:39:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo)
[15:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:28] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[15:40:05] <godog>	 !log restart navtiming ve asset-check statsd-mw-js-deprecate on hafnium to pick up statsd.eqiad.wmnet change - T157022
[15:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:12] <elukey>	 !log restart eventlogging on kafka200[123] for openssl upgrades
[15:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:41] <icinga-wm>	 RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.000217 secs
[15:42:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Only output "changed" values if actually changed [software/conftool] - 10https://gerrit.wikimedia.org/r/338985
[15:42:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add explicit dependencies [WiP] [software/conftool] - 10https://gerrit.wikimedia.org/r/338986
[15:42:35] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043191 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The...
[15:42:40] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043192 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1034.eqiad.wmnet'] ``` The...
[15:43:06] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo)
[15:43:34] <elukey>	 mobrovac: I don't really see a lot of traffic on kafka200[123] for EL
[15:43:57] <mobrovac>	 elukey: EL?
[15:44:14] <elukey>	 eventlogging or the http service, as you want to call it :D
[15:44:41] <icinga-wm>	 PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[15:47:01] <icinga-wm>	 PROBLEM - DPKG on rhenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:47:41] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active
[15:48:01] <icinga-wm>	 RECOVERY - DPKG on rhenium is OK: All packages OK
[15:48:41] <icinga-wm>	 PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[15:48:54] * gehel having a look at wdqs1002...
[15:49:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add explicit dependencies [WiP] [software/conftool] - 10https://gerrit.wikimedia.org/r/338986 (owner: 10Giuseppe Lavagetto)
[15:50:41] <godog>	 mobrovac: scb looking good, thanks! would you have time for ores and parsoid as well? if not I'll roll-restart in ~1h after a meeting
[15:50:41] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[15:50:52] <gehel>	 !log restarting wdqs-updater on wdqs1002
[15:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:35] <mobrovac>	 godog: i think you better ping Amir1 for ores, I can do parsoid 2h from now
[15:51:58] <Amir1>	 hey, what I need to do for Ores
[15:52:07] <godog>	 mobrovac: ack thanks! I'll ping you if I don't get to do parsoid
[15:52:24] <godog>	 Amir1: hey, we'd need a simple rolling-restart for ores to pick up DNS changes for statsd.eqiad.wmnet
[15:53:31] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3570.30 Read Requests/Sec=3083.90 Write Requests/Sec=35.20 KBytes Read/Sec=28135.20 KBytes_Written/Sec=7730.00
[15:53:41] <icinga-wm>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[15:54:59] <Amir1>	 godog: Okay, I do it now, if it's okay. 
[15:58:08] <godog>	 Amir1: sure, thanks!
[16:03:45] <wikibugs>	 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#3043262 (10faidon) @MoritzMuehlenhoff, any news from openldap-technical or in general about this?
[16:05:51] <icinga-wm>	 PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 2183 MB (8% inode=96%)
[16:06:51] <dcausse>	 !log truncated main log file on elastic1030
[16:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:31] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=114.40 Read Requests/Sec=193.80 Write Requests/Sec=3.20 KBytes Read/Sec=2030.00 KBytes_Written/Sec=373.60
[16:08:09] <moritzm>	 !log restarting apache on uranium for openssl update
[16:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:14] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1034.eqiad.wmnet'] ```  and were **ALL** successful.
[16:08:54] <wikibugs>	 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#3043271 (10MoritzMuehlenhoff) No yet, no.
[16:11:21] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#3043273 (10jcrespo)
[16:11:26] <wikibugs>	 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3043272 (10jcrespo)
[16:11:28] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3043274 (10jcrespo)
[16:15:31] <icinga-wm>	 PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%)
[16:15:51] <icinga-wm>	 RECOVERY - Disk space on elastic1030 is OK: DISK OK
[16:17:51] <wikibugs>	 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3043279 (10faidon) @Dzahn, what's the status of this?
[16:18:16] <dcausse>	 !log truncated main elastic log, daemon.log and syslog on elastic1023
[16:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:23] <wikibugs>	 (03PS1) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702)
[16:18:31] <icinga-wm>	 PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:19:31] <icinga-wm>	 RECOVERY - Disk space on elastic1023 is OK: DISK OK
[16:21:11] <icinga-wm>	 PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Cannot connect to keyholder-proxy socket /run/keyholder/proxy.sock.
[16:23:07] <wikibugs>	 (03PS2) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702)
[16:26:01] <icinga-wm>	 PROBLEM - DPKG on dataset1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[16:26:15] <moritzm>	 ^expected, update in progress
[16:26:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852
[16:27:01] <icinga-wm>	 RECOVERY - DPKG on dataset1001 is OK: All packages OK
[16:27:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: prometheus: add etcd metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto)
[16:27:08] <_joe_>	 godog: care to re-review? ^^
[16:27:19] <_joe_>	 I'd like to take it to production
[16:27:30] <_joe_>	 tomorrow is fine ofc
[16:27:31] <_joe_>	 :)
[16:27:51] <icinga-wm>	 PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 518 MB (1% inode=96%)
[16:28:46] <jynus>	 will it be ok with so little space?
[16:29:21] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3043290 (10Gehel)
[16:29:31] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3043303 (10Gehel) p:05Triage>03High
[16:29:46] <moritzm>	 dcausse already truncated the log earlier the day
[16:29:51] <jynus>	 oh
[16:29:53] <jynus>	 it is /
[16:29:58] <jynus>	 I missread it as /srv
[16:30:06] <jynus>	 not worried, then
[16:30:21] * gehel is slightly worried, but not too much :)
[16:30:47] <jynus>	 well, if logs are lost is bad, if a service goes down is worse :-)
[16:31:34] <gehel>	 !log truncating elasticsearch logs on elastic1030
[16:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:51] <icinga-wm>	 RECOVERY - Disk space on elastic1030 is OK: DISK OK
[16:32:12] <wikibugs>	 (03PS5) 10Ottomata: Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric)
[16:32:31] <icinga-wm>	 PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%)
[16:33:21] <wikibugs>	 (03PS6) 10Ottomata: Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric)
[16:34:20] <gehel>	 !log truncating elasticsearch logs on elastic1023
[16:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:31] <icinga-wm>	 RECOVERY - Disk space on elastic1023 is OK: DISK OK
[16:34:51] <icinga-wm>	 PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 1478 MB (5% inode=96%)
[16:36:13] <wikibugs>	 (03PS3) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702)
[16:36:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto)
[16:37:17] <gehel>	 !log restarting elasticsearch on elastic1030
[16:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:31] <icinga-wm>	 PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 1456 MB (5% inode=96%)
[16:38:41] <icinga-wm>	 RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0]
[16:39:51] <icinga-wm>	 RECOVERY - Disk space on elastic1030 is OK: DISK OK
[16:40:31] <icinga-wm>	 RECOVERY - Disk space on elastic1023 is OK: DISK OK
[16:41:24] <gehel>	 ok, we should be good again on those logs... it is more than time to upgrade to elastic 5!
[16:42:27] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043355 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The...
[16:42:47] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043356 (10elukey) Checked the oxygen logs and the following UA is the only one getting 503s during the past 21 days:  ```244268 "Wikipedia/10...
[16:47:42] <bd808>	 stashbot: help
[16:47:43] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[16:48:31] <icinga-wm>	 RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[16:50:56] <wikibugs>	 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10Ottomata) Just FYI, there is a Kafka based Monolog implementation in Mediawiki, currently used by the Discovery team for shipping some logs to Hadoop.  I betcha we could pretty easily use...
[16:52:12] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043475 (10Milimetric) Ping @Fjalapeno this UA is the iOS app, right?  Any help you can provide Luca in finding out why we might be seeing 503...
[16:52:26] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric)
[16:55:24] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043511 (10elukey) Just adding a note: I am seeing also others similar UA, that follows the same pattern.. but nothing else. I suspect that I...
[16:55:56] <hashar>	 jouncebot: refresh
[16:55:57] <jouncebot>	 I refreshed my knowledge about deployments.
[16:56:01] <hashar>	 jouncebot: next
[16:56:01] <jouncebot>	 In 0 hour(s) and 3 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1700)
[16:56:16] <hashar>	 Puppet SWAT is empty.  We moved my patches to tomorrow morning :)
[16:59:29] <ema>	 !log cache_misc, cache_maps: libssl1.1 upgraded to 1.1.0e-1+wmf1, libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1
[16:59:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1700). Please do the needful.
[17:01:49] <wikibugs>	 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3043535 (10Jgreen) a:05Jgreen>03RobH Reassigning to Rob because we're stuck at a hardware problem (new HDDs appear to be incompatible with the controller/BIOS/firmware?)
[17:04:10] <godog>	 oohh puppet swat empty? https://i.redd.it/6osjlug3xugy.gif
[17:04:15] <wikibugs>	 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3043539 (10RobH) So now the system is in a bad state where I cannot login to the webGUI to upgrade firmware, and its not coming back from it by racreset.  in detail: trying to log...
[17:05:01] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 19
[17:05:56] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043555 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet'] ```  and were **ALL** successful.
[17:06:03] <wikibugs>	 06Operations: Expire time on 404 is too high (Wikipedia) - https://phabricator.wikimedia.org/T157214#3043556 (10Aklapper) 05Open>03declined Unfortunately closing this report as no further information has been provided. @Mjbmr: Please reopen this report (by changing its status) after you have provided the inf...
[17:07:41] <wikibugs>	 (03PS5) 10Madhuvishy: diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769
[17:08:41] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:08:58] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 (owner: 10Madhuvishy)
[17:10:07] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043573 (10Fjalapeno)
[17:10:21] <wikibugs>	 (03PS1) 10Jcrespo: Upcoming mediawiki-core hardware expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580)
[17:10:42] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032 C: 032] diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 (owner: 10Madhuvishy)
[17:12:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "This is not intended for commit (but please do not abandon for 1-2 years)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo)
[17:13:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Upcoming mediawiki-core hardware expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo)
[17:13:53] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043580 (10Fjalapeno) @Milimetric having @joewalsh verify this for you
[17:13:54] <wikibugs>	 (03PS3) 10Madhuvishy: labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785
[17:14:29] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 (owner: 10Madhuvishy)
[17:17:56] <godog>	 Amir1: did the ores rolling restart happen? still seeing statsd metrics towards graphite2001
[17:18:30] <Amir1>	 godog: sorry for late work, I was looking for my yubikey
[17:19:30] <godog>	 _joe_: no worries, I'm around for another hour at least
[17:19:34] <godog>	 no, that was for Amir1 
[17:19:52] <godog>	 _joe_: re https://gerrit.wikimedia.org/r/#/c/336852 I had a comment about using the default ssl vhost too
[17:20:03] <_joe_>	 sorry, meeting 
[17:20:11] <Amir1>	 !log restarting ores uwsgi and celery services in scb nodes
[17:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:33] <Amir1>	 1 down, 7 to go (doing codfw ones too)
[17:20:38] <godog>	 Amir1: neat, thanks!
[17:27:00] <Amir1>	 godog: eqiad nodes are done now
[17:27:06] <Amir1>	 going to codfw nodes
[17:34:36] <wikibugs>	 (03PS1) 10Gehel: WIP - elasticsearch: only send minimal logging to console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664)
[17:35:01] <Amir1>	 !log done restarting ores services
[17:35:04] <Amir1>	 godog: ^
[17:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:14] <godog>	 Amir1: fantastic, thanks for your help!
[17:35:32] <Amir1>	 Thank you!
[17:35:55] <godog>	 !log roll-restart parsoid in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022
[17:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:01] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[17:36:41] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[17:43:41] <icinga-wm>	 PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:45:19] <wikibugs>	 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Reedy) https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.ee  If you follow "Add a record to your domain settings (Recommended)", and provide t...
[17:46:56] <wikibugs>	 (03PS1) 10Chad: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999
[17:47:12] <godog>	 !log roll-restart jmxtrans in codfw/eqiad on conf* to pick up statsd.eqiad.wmnet DNS changes - T157022
[17:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:18] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[17:48:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad)
[17:49:09] <wikibugs>	 (03PS2) 10Chad: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999
[17:49:15] <RainbowSprinkles>	 stupid unit test
[17:50:36] <godog>	 !log roll-restart ocg in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022
[17:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:54] <wikibugs>	 (03CR) 10Chad: [C: 032] Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad)
[17:56:09] <wikibugs>	 (03Merged) 10jenkins-bot: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad)
[17:57:17] <wikibugs>	 (03CR) 10jenkins-bot: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad)
[17:57:31] <wikibugs>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10CRoslof) One- and two-character .org domain names aren't available for general registration. See, for example, this press release...
[17:57:44] <logmsgbot>	 !log demon@tin Synchronized multiversion/MWMultiVersion.php: Shut up dumb invalid hostname errors (duration: 00m 52s)
[17:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:19] <wikibugs>	 (03CR) 10Dzahn: "why the change? you don't want it to redirect to the tools directly anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[17:58:37] <logmsgbot>	 !log demon@tin Synchronized tests/multiversion/MWMultiVersionTest.php: No op in prod, completeness, etc (duration: 00m 40s)
[17:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1800).
[18:00:06] <wikibugs>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10Dzahn) Even if we would be able to get it and wanted to use it, it would still be blocked on T133548.
[18:03:12] <Amir1>	 nothing for ores today
[18:03:50] <godog>	 !log roll-restart trendingedits in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022
[18:03:52] <wikibugs>	 (03PS1) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001
[18:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:55] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[18:04:15] <wikibugs>	 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3043748 (10Reedy) Oh, and if you're wanting to use Google Apps like that.. I suspect your mail server MX records will need updating - https://github.com/wikimedia/operations-dns/b...
[18:04:21] <wikibugs>	 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3043749 (10Dzahn) check_ipmi_sensor has been installed across the fleet but doesn't work.  running it with options for temperature makes it exit with "CRIT" for _non_-temperature things  root@lead:~# /usr/lo...
[18:04:41] <godog>	 !log roll-restart eventstreams in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022
[18:04:42] <wikibugs>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10demon) Heh, I had this idea like 5 **years** ago but never felt like bothering to follow-up on it. Plus T133548
[18:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy)
[18:05:22] <wikibugs>	 (03PS2) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001
[18:06:28] <godog>	 hashar: bouncing zuul on contint1001 isn't impactful is it?
[18:06:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy)
[18:06:51] <wikibugs>	 (03PS3) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001
[18:12:03] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Mailing-lists: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#3043827 (10RobH) p:05Triage>03High a:05RobH>03BBlack I'm just not getting through this fast enough, so I'm reassigning this to B...
[18:12:10] <godog>	 !log roll-restart zuul on cont1001 to pick up statsd.eqiad.wmnet DNS changes - T157022
[18:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:15] <stashbot>	 T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022
[18:12:41] <icinga-wm>	 RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[18:12:57] <godog>	 !log roll-restart nodepool on labnodepool1001 to pick up statsd.eqiad.wmnet DNS changes - T157022
[18:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:24] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(27|32|34|38|41).eqiad.wmnet
[18:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:16] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3043838 (10Aklapper) @Jseddon / @MBeat33: Any news?
[18:18:45] <wikibugs>	 (03PS2) 10Volans: Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660)
[18:20:08] <wikibugs>	 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3043870 (10RobH) {F5743871} is the zip of the license info.  @Papaul: Next time you need me to pull this, please assign it to me so I won't miss it.  Please update the license on the system.  While this...
[18:23:08] <wikibugs>	 (03PS1) 10Volans: Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659)
[18:23:11] <wikibugs>	 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3043881 (10RobH) The license should not expire, that is strange.  I've downloaded it from Dell's license management site:    {F5743908} - iDRAC7 Enterprise,Perpetual,Digital License only  Please up...
[18:23:23] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043886 (10Fjalapeno) @Milimetric  @elukey verified that this is the iOS app
[18:23:39] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043892 (10Jgreen)
[18:24:17] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043908 (10Jgreen) @EWilfong_WMF are you the right point of contact for Trilogy for this?
[18:24:40] <wikibugs>	 06Operations, 10Graphite: Improve graphite failover - https://phabricator.wikimedia.org/T88997#3043910 (10fgiunchedi)
[18:26:02] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004
[18:26:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Not ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo)
[18:26:59] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043925 (10RobH) Please note that some potential details for this are also on private task T156849.  However, relevant info has been...
[18:28:12] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043933 (10Volans) @mmodell I'm not sure what's the status with the https://phabricator.wikimedia.org/source/keyholder/ repository that was recently created.  I'...
[18:30:26] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043941 (10mmodell) @volans: Thanks for the heads-up. We still use the code from puppet in prod. It will will remain that way until I get the package accepted by...
[18:34:21] <icinga-wm>	 PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:34:27] <wikibugs>	 (03CR) 10Rush: [C: 031] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy)
[18:34:43] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy)
[18:34:59] <Pchelolo>	 !log changeprop deploy 4706f9da
[18:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:49] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@4706f9d]: Change-Prop: Make ORES return minified responses T157693
[18:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:54] <stashbot>	 T157693: Use minified JSON format in ChangeProp - https://phabricator.wikimedia.org/T157693
[18:36:44] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@4706f9d]: Change-Prop: Make ORES return minified responses T157693 (duration: 00m 55s)
[18:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:24] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043971 (10JoeWalsh) @Milimetric this UA is from the iOS app. In testing locally, I didn't see any 503s. A potential cause of the surge...
[18:45:11] <wikibugs>	 (03PS1) 10Chad: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005
[18:45:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo)
[18:45:23] <wikibugs>	 (03CR) 10Chad: [C: 04-2] "For l8r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad)
[18:45:58] <moritzm>	 !log installing PHP security updates on iridium (phabricator.wikimedia.org)
[18:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:13] <wikibugs>	 (03CR) 10Tim Landscheidt: "@Dzahn: The change should be a no-op for the redirects (at least that is my intention).  I just want to use the same syntax for all redire" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[18:49:31] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2866.70 Read Requests/Sec=5072.80 Write Requests/Sec=7.80 KBytes Read/Sec=23726.80 KBytes_Written/Sec=234.00
[18:49:53] <logmsgbot>	 !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build
[18:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:31] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4671.60 Read Requests/Sec=2119.00 Write Requests/Sec=32.80 KBytes Read/Sec=18994.40 KBytes_Written/Sec=5586.80
[18:56:11] <icinga-wm>	 RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys.
[18:58:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Not sure if that's really desirable. I agree the specific log message is superfluous, but firejail doesn't have a concept of log verbosity" [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[19:01:16] <RainbowSprinkles>	 Dangit, scap didn't pick up my testwiki to wmf.13
[19:01:25] <RainbowSprinkles>	 So didn't build l10n :(
[19:02:21] <icinga-wm>	 RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[19:02:31] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=247.50 Read Requests/Sec=306.70 Write Requests/Sec=74.30 KBytes Read/Sec=8127.60 KBytes_Written/Sec=751.60
[19:03:54] <RainbowSprinkles>	 -    "testwiki": "php-1.29.0-wmf.12",
[19:03:55] <RainbowSprinkles>	 +    "testwiki": "php-1.29.0-wmf.13",
[19:03:56] <RainbowSprinkles>	 Why not?
[19:03:56] <RainbowSprinkles>	 stupid scap
[19:04:22] <wikibugs>	 (03PS1) 10Madhuvishy: labstore: Change prefix depth and byteunit config for dir size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339006
[19:04:28] <thcipriani>	 RainbowSprinkles: ah crap :( This is https://phabricator.wikimedia.org/T156851
[19:04:45] <RainbowSprinkles>	 Ahhh
[19:04:50] <RainbowSprinkles>	 I forgot that bug
[19:04:52] <RainbowSprinkles>	 Hmm
[19:04:56] <wikibugs>	 (03PS1) 10Rush: wip: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007
[19:05:10] <thcipriani>	 I have a fix. The workaround for now is to run a scap pull on tin before the sync
[19:05:12] <thcipriani>	 sorry :(
[19:05:30] <RainbowSprinkles>	 Ah, I'll let this sync finish first so the files go out everywhere
[19:05:32] <RainbowSprinkles>	 Then do that
[19:05:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wip: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007 (owner: 10Rush)
[19:06:02] <thcipriani>	 if you sync again it'll just work
[19:06:22] <thcipriani>	 just make sure that /srv/mediawiki/wikiversions.json is correct, but I think it's just an order of operations thing.
[19:06:50] * RainbowSprinkles nods
[19:06:51] <RainbowSprinkles>	 Thx
[19:06:54] <thcipriani>	 totally my fault in the 3.5.x release :(
[19:07:09] <Reedy>	 Feels like dejavu ;P
[19:07:48] <thcipriani>	 heh, there's some fun scap history on that task. I'm retravelling well worn paths it seems.
[19:07:55] <wikibugs>	 (03CR) 10Rush: openstack: nova_fullstack_test changes to daemonize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush)
[19:08:10] <wikibugs>	 (03PS11) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598
[19:09:48] <wikibugs>	 (03PS12) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598
[19:10:14] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: Change prefix depth and byteunit config for dir size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339006 (owner: 10Madhuvishy)
[19:14:21] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:14:54] <paladox>	 moritzm hi, upstream are doing the systemd file it seems https://gerrit-review.googlesource.com/#/c/89893/ :)
[19:15:11] <paladox>	 (gerrit)
[19:15:19] <moritzm>	 nice!
[19:15:51] <paladox>	 yep, will be available in gerrit 2.13.6 according to https://groups.google.com/forum/#!topic/repo-discuss/SL_lXZDDG_g
[19:16:08] <logmsgbot>	 !log demon@tin Finished scap: prime wmf.13 - testwiki plus l10n build (duration: 26m 15s)
[19:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:12] <logmsgbot>	 !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build (pt 2 because T156851)
[19:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:17] <stashbot>	 T156851: scap wikiversions compile happening too late in scap sync - https://phabricator.wikimedia.org/T156851
[19:18:11] <wikibugs>	 06Operations, 10ops-eqiad, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3044110 (10RobH) a:03Cmjohnson
[19:18:21] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[19:18:34] <RainbowSprinkles>	 thcipriani: Indeed, working now
[19:20:01] <RainbowSprinkles>	 thcipriani: Heh, completing the sync then doing a second? Results in a bajillion File not found: /srv/mediawiki/php-1.29.0-wmf.13/../wmf-config/ExtensionMessages-1.29.0-wmf.13.php in /srv/mediawiki/wmf-config/CommonSettings.php on line 3416
[19:20:33] <paladox>	 moritzm seems the change also auto starts gerrit whats a request on 80 is recived. I think thats if someone goes to gerrit.wikimedia.org and it's down. it will auto start though i may have misread it or something.
[19:20:39] <paladox>	 mutante ^^ :)
[19:20:52] <paladox>	 "and a corresponding gerrit.service file enables an automatic start of gerrit
[19:20:52] <paladox>	 on the first request on port 80."
[19:21:05] <RainbowSprinkles>	 Well that seems like bizarre behavior
[19:21:12] <RainbowSprinkles>	 Why would you want it to be off until port 80 is hit?
[19:21:13] <RainbowSprinkles>	 :)
[19:21:41] <RainbowSprinkles>	 Or, if you want to take it down for maintenance, have it come up automatically because someone tries hitting it :)
[19:21:53] <RainbowSprinkles>	 (Also, their systemd file looks mostly useless for us, we don't serve over :80
[19:22:13] <RainbowSprinkles>	 We serve over 8080 which is proxied to 443 for users
[19:23:36] <paladox>	 RainbowSprinkles not sure.
[19:23:59] <paladox>	 we will want to customise the file since well they are putting it on this path /opt/gerritsrv/
[19:24:07] <paladox>	 which isen't on /var/lib/gerrit2.
[19:24:19] <RainbowSprinkles>	 We can just write our own :)
[19:24:25] <paladox>	 Already done that
[19:24:29] <RainbowSprinkles>	 There's no point in using theirs, it's a dumb stub :)
[19:24:37] <RainbowSprinkles>	 I know, we should just keep our own
[19:24:40] <paladox>	 RainbowSprinkles https://gerrit.wikimedia.org/r/#/c/333475/
[19:24:41] <paladox>	 ok
[19:25:11] <paladox>	 that one works better in my testing then the init.d one.
[19:25:44] <RainbowSprinkles>	 It's still ultimately the same script :)
[19:25:53] <RainbowSprinkles>	 Our init.d was just copied from ./bin/gerrit.sh
[19:25:54] <RainbowSprinkles>	 :)
[19:26:10] <paladox>	 Yep, but just never worked when you ran the script as root
[19:26:19] <paladox>	 i mean sudo service gerrit start or stop
[19:26:23] <RainbowSprinkles>	 It worked in prod ;-)
[19:26:29] <RainbowSprinkles>	 Otherwise I would've been worried
[19:26:34] <paladox>	 oh
[19:27:10] <paladox>	 how did you manage to get it working? Is the pid run as root? The pid for me keeps getting set as root. but with systemd it is gerrit2.
[19:27:35] <RainbowSprinkles>	 It's gerrit2
[19:28:07] <paladox>	 yep
[19:32:13] <logmsgbot>	 !log demon@tin scap failed: RuntimeError 2 test canaries had check failures (rerun with --force to override this check) (duration: 15m 00s)
[19:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:44] <logmsgbot>	 !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build (pt 3 because ugh)
[19:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:58] <paladox>	 RainbowSprinkles and moritizm i will just try to upstream the version i created of systemd for gerrit.
[19:33:06] <RainbowSprinkles>	 thcipriani: Heh, upside is that canary checks did their job. Downside is I'm trying to *fix* that breakage ;-)
[19:33:14] <paladox>	 They can have two version of the file as you can write it in different ways in systemd :)
[19:33:22] <thcipriani>	 RainbowSprinkles: :(
[19:36:48] <RainbowSprinkles>	 wtf...?
[19:36:59] <RainbowSprinkles>	  /wiki/Special:Version MWException from line 481 of /srv/mediawiki/php-1.29.0-wmf.13/includes/cache/localisation/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
[19:37:57] <thcipriani>	 ugh, what is happening? Is scap not rebuilding l10n?
[19:38:18] <RainbowSprinkles>	 Claimed it did
[19:39:20] <RainbowSprinkles>	 Lets let the current scap finish so the can't find extensionmessages bit goes away
[19:39:24] <RainbowSprinkles>	 Then we'll figure out why no en language
[19:39:56] <thcipriani>	 didn't rebuild the cdbs...
[19:40:08] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "one leftover from debugging, see inline" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto)
[19:40:17] <thcipriani>	 well, at least so far, just have the upstream/*.{json,md5} files
[19:40:50] <RainbowSprinkles>	 Ah, I figured it out
[19:40:53] <thcipriani>	 should do that as the last step of a scap, I guess...
[19:41:00] <RainbowSprinkles>	 scap-rebuild-cdbs didn't do anything the first time
[19:41:01] <icinga-wm>	 PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:41:04] <RainbowSprinkles>	 Because of the bug
[19:41:28] <thcipriani>	 ah, right, since it is not an active wikiversion the first time around.
[19:41:30] <RainbowSprinkles>	 But we've swapped to wmf.13, so testwiki is expecting the cache files
[19:41:41] <RainbowSprinkles>	 It *should* clean up as this scap finishes
[19:42:16] <RainbowSprinkles>	 thcipriani: This bug is nasty :(
[19:42:24] <thcipriani>	 hrm...update wikiversions should double-check the cdb files as well. I know we do that in several places.
[19:43:43] <thcipriani>	 indeed. it's ugly. and it's a tangly mess.
[19:46:35] <paladox>	 RainbowSprinkles did you notice the extra emails gerrit sends out now.
[19:46:40] <RainbowSprinkles>	 No
[19:46:54] <RainbowSprinkles>	 thcipriani: scap-rebuild-cdbs is fixing it this time
[19:46:58] <RainbowSprinkles>	 Error rate going down
[19:47:17] <paladox>	 Well i have a ton even from changes that have not changed in a while. Makes it hard to look at newer changes needing reviews.
[19:47:32] <paladox>	 But upstream have annoleged the bug and have a fix for the rest api
[19:47:41] <paladox>	 but not for the ui, gwt and polygerrit
[19:48:02] <RainbowSprinkles>	 I disable most e-mails anyway :)
[19:48:06] <paladox>	 oh
[19:48:15] <paladox>	 the bug only happened in 2.13.
[19:48:36] <RainbowSprinkles>	 2.13 is a terrible release!
[19:48:57] <paladox>	 Yeh, 2.14 will be a great release :)
[19:49:10] <paladox>	 Hopefully less buggy
[19:49:11] <RainbowSprinkles>	 We'll wait until at least 2.14.1 or .2
[19:49:14] <RainbowSprinkles>	 I don't trust .0 
[19:49:15] <RainbowSprinkles>	 :D
[19:49:18] <paladox>	 yep
[19:49:32] <Reedy>	 RainbowSprinkles: You will love it. So awesome.
[19:49:42] <RainbowSprinkles>	 Maybe gerrit will be great again
[19:49:47] <paladox>	 lol
[19:49:48] <RainbowSprinkles>	 Like 2.8.forever
[19:49:53] <RainbowSprinkles>	 2.8.x was a great release
[19:49:55] <RainbowSprinkles>	 The last good release
[19:49:56] <RainbowSprinkles>	 :p
[19:50:02] <logmsgbot>	 !log demon@tin Finished scap: prime wmf.13 - testwiki plus l10n build (pt 3 because ugh) (duration: 17m 17s)
[19:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:14] <paladox>	 RainbowSprinkles what did you like about 2.8?
[19:50:19] <RainbowSprinkles>	 It was stable
[19:50:24] <RainbowSprinkles>	 So. Freaking. Stable.
[19:50:28] <paladox>	 oh ah.
[19:50:35] <RainbowSprinkles>	 Bugs? Sure. But they were UI quirks and minor things
[19:51:11] <paladox>	 RainbowSprinkles i think everything will be unstable as upstream are going with NoteDb and doint allow access to rest api
[19:51:34] <paladox>	 so no one really expererence the bugs there so that may be why everything is buggy.
[19:52:06] <RainbowSprinkles>	 thcipriani: All better now.
[19:52:08] <paladox>	 RainbowSprinkles my patch for allowing owners to delete there own changes was merged :)
[19:52:09] <RainbowSprinkles>	 That was...annoying
[19:52:45] <thcipriani>	 agreed. will have a fix whenever I can get testing to work.
[19:52:57] <paladox>	 The component there building polygerrit on is available for gwt. Polymer for gwt.
[19:53:40] <paladox>	 Reedy the 2.13.6 release will help us prevent problems like https://phabricator.wikimedia.org/T153079
[19:53:51] <paladox>	 includes a couple of fixes for submodules :)
[20:00:05] <jouncebot>	 RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T2000). Please do the needful.
[20:00:40] <wikibugs>	 (03PS1) 10Gergő Tisza: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012
[20:01:51] <RainbowSprinkles>	 jouncebot: get with the program, I already am
[20:03:53] <wikibugs>	 (03PS1) 10Madhuvishy: labstore: Log directory size collector size in bytes [puppet] - 10https://gerrit.wikimedia.org/r/339013
[20:04:16] <wikibugs>	 (03CR) 10Chad: [C: 032] group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad)
[20:07:23] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad)
[20:07:31] <wikibugs>	 (03CR) 10jenkins-bot: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad)
[20:08:18] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.13
[20:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:28] <urandom>	 cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425?
[20:08:39] <urandom>	 cmjohnson1: anything would useful to plan by
[20:08:53] <urandom>	 cmjohnson1: including "when hell freezes over"
[20:08:54] <urandom>	 :)
[20:09:01] <icinga-wm>	 PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:09:03] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Log directory size collector size in bytes [puppet] - 10https://gerrit.wikimedia.org/r/339013 (owner: 10Madhuvishy)
[20:09:08] <wikibugs>	 (03PS1) 10BBlack: LE: allow non-root key ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/339015 (https://phabricator.wikimedia.org/T154917)
[20:09:10] <wikibugs>	 (03PS1) 10BBlack: lists: use LE cert for exim [puppet] - 10https://gerrit.wikimedia.org/r/339016 (https://phabricator.wikimedia.org/T154917)
[20:09:14] <RainbowSprinkles>	 urandom: Between now and the heat death of the universe? ;-)
[20:09:49] <urandom>	 RainbowSprinkles: sure, sure
[20:10:01] <icinga-wm>	 RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:10:07] <urandom>	 though, you know, anything to narrow that down...
[20:11:12] <RainbowSprinkles>	 thcipriani: group0 appears quiet, minus that spike due to cache fun times.
[20:11:18] <RainbowSprinkles>	 So, success?
[20:11:45] <RainbowSprinkles>	 (just the usual culprit of redis)
[20:12:49] <thcipriani>	 well. A success in terms of the new version anyway.
[20:12:59] <thcipriani>	 (new version of mediawiki)
[20:14:24] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - reimage elastic10(35|39|43|44) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339017 (https://phabricator.wikimedia.org/T151326)
[20:15:49] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(35|39|43|44).eqiad.wmnet
[20:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] "Looking forward to seeing this in action." [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush)
[20:17:21] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044385 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1035.eqiad.wmnet'] ``` The...
[20:17:22] <wikibugs>	 (03CR) 10Paladox: [C: 031] "We can now do this as the fix for ipv6 was rolled out on our install." [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4)
[20:17:30] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044386 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1039.eqiad.wmnet'] ``` The...
[20:17:37] <wikibugs>	 (03PS2) 10Paladox: phabricator: enable vcs and web user to run `git` and `ssh` via sudo [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4)
[20:17:45] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044387 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1044.eqiad.wmnet'] ``` The...
[20:17:57] <wikibugs>	 (03CR) 10Paladox: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4)
[20:18:01] <wikibugs>	 (03PS1) 10Thcipriani: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018
[20:18:03] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044388 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1043.eqiad.wmnet'] ``` The...
[20:19:11] <wikibugs>	 (03CR) 10Paladox: [C: 031] "@Dzahn this is needed, otherwise the file doesn't not get correctly created. IE some erb syntax is left in the file. Tested this fix local" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4)
[20:20:41] <icinga-wm>	 PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:22:41] <icinga-wm>	 PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:23:48] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3044424 (10EWilfong_WMF) @Jgreen Yes, I will be the point of contact for this update.  This domain is hosted using Azure's App Servic...
[20:24:21] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:25:09] <wikibugs>	 (03PS1) 10Dzahn: install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020)
[20:26:37] <wikibugs>	 (03PS2) 10Dzahn: install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020)
[20:28:21] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[20:31:08] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn)
[20:31:32] <wikibugs>	 (03PS1) 10Dzahn: install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020)
[20:36:33] <mutante>	 !log rsyncing /home/ dirs excl. dot files, from carbon to install1002 (T158020)
[20:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:39] <stashbot>	 T158020: decom carbon - https://phabricator.wikimedia.org/T158020
[20:37:01] <icinga-wm>	 RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[20:37:41] <icinga-wm>	 RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:39:16] <wikibugs>	 (03PS1) 10Dzahn: install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020)
[20:39:19] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044511 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1043.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1043.eqi...
[20:42:15] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044520 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1039.eqiad.wmnet'] ```  and were **ALL** successful.
[20:44:27] <mutante>	 !log carbon - backup /root data to install1002:/root/root-carbon/ before shutdown (T158020)
[20:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:31] <stashbot>	 T158020: decom carbon - https://phabricator.wikimedia.org/T158020
[20:45:19] <wikibugs>	 (03CR) 10Chad: [C: 032] scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani)
[20:45:45] <wikibugs>	 (03PS2) 10Dzahn: install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020)
[20:47:10] <wikibugs>	 (03Merged) 10jenkins-bot: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani)
[20:47:26] <wikibugs>	 (03CR) 10jenkins-bot: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani)
[20:47:40] <Krinkle>	 !log (terbium) sql --write testwiki 'DELETE FROM module_deps' (per T158105)
[20:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:44] <stashbot>	 T158105: "PHP Warning: filemtime(): No such file or directory" about files removed over a year ago - https://phabricator.wikimedia.org/T158105
[20:48:22] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044539 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1044.eqiad.wmnet'] ```  and were **ALL** successful.
[20:48:37] <Krinkle>	 !log (terbium) sql --write test2wiki 'DELETE FROM module_deps' (3687 rows affected, 0.01 sec) - per T158105.
[20:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:11] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044542 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1035.eqiad.wmnet'] ```  and were **ALL** successful.
[20:49:53] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn)
[20:50:31] <wikibugs>	 (03PS2) 10Dzahn: install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020)
[20:51:41] <icinga-wm>	 RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[20:52:16] <wikibugs>	 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044558 (10Dzahn)
[20:59:57] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn)
[21:03:08] <wikibugs>	 (03CR) 10Dzahn: "i don't think it's a no-op for the redirects. before you get a redirect to $1 so each tool, after you redirect everything to the overview " [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[21:06:19] <wikibugs>	 (03CR) 10Tim Landscheidt: "JFTR: Didn't look further into my claim of a failure on Trusty; it works where it should work, and if it does not somewhere that would mak" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[21:08:06] <icinga-wm>	 PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:10:04] <logmsgbot>	 !log demon@tin Synchronized scap/plugins/prep.py: Completeness (duration: 00m 42s)
[21:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:09] <RainbowSprinkles>	 thcipriani: You're live ^
[21:10:43] <thcipriani>	 RainbowSprinkles: whoa boy. Thanks :)
[21:13:29] <wikibugs>	 (03CR) 10Tim Landscheidt: "(I meant "no-op" = "no change to the previous behaviour", but I think we both do :-).)" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[21:14:01] <wikibugs>	 (03PS1) 10Chad: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029
[21:20:49] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry)
[21:21:04] <wikibugs>	 (03PS3) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313)
[21:21:06] <wikibugs>	 (03PS1) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032
[21:21:23] <wikibugs>	 (03PS2) 10Papaul: Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337)
[21:22:22] <wikibugs>	 (03PS2) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032
[21:22:24] <wikibugs>	 (03PS4) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313)
[21:22:45] <wikibugs>	 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3044646 (10Papaul)
[21:24:32] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] "PCC looks good as expected - https://puppet-compiler.wmflabs.org/5519/" [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry)
[21:26:43] <wikibugs>	 (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry)
[21:30:39] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=321.70 Read Requests/Sec=1563.30 Write Requests/Sec=6.10 KBytes Read/Sec=31028.80 KBytes_Written/Sec=130.40
[21:30:40] <wikibugs>	 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044667 (10Pchelolo) So, we've discussed this on the team meeting and decided to move forward on this.  The final question is whether to use syslog-over-udp or normal file logging? We...
[21:31:26] <mutante>	 !log carbon - puppet node clean, node deactivate (T158020)
[21:31:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:33] <stashbot>	 T158020: decom carbon - https://phabricator.wikimedia.org/T158020
[21:33:44] <wikibugs>	 (03PS1) 10Chad: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035
[21:35:57] <cmjohnson1>	 urandom: I was waiting on the spare disks that did arrive. I will swap it out in the morning
[21:36:19] <icinga-wm>	 RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[21:36:23] <wikibugs>	 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044690 (10GWicke) @Pchelolo, logging directly to a file is synchronous, which is bad for performance & can cause outages. See the earlier discussion for an example of such an outage....
[21:36:43] <urandom>	 cmjohnson1: awesome; thanks!
[21:39:45] <wikibugs>	 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3044700 (10Eevans) To summarize from IRC today:  ```lang=irc 15:08 < urandom> cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425? ... 16:35 < cmjohnson1> uran...
[21:40:22] <wikibugs>	 (03PS1) 10Gergő Tisza: Fix PageViewInfo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339041 (https://phabricator.wikimedia.org/T158698)
[21:43:13] <wikibugs>	 (03PS3) 10Gergő Tisza: Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656)
[21:43:39] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=116.70 Read Requests/Sec=194.00 Write Requests/Sec=6.30 KBytes Read/Sec=1991.60 KBytes_Written/Sec=297.20
[21:45:03] <wikibugs>	 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044742 (10Dzahn)
[21:49:20] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] "A couple of small nits inline." (032 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix)
[21:50:57] <wikibugs>	 (03PS4) 10Zppix: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700
[21:51:11] <wikibugs>	 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044773 (10Pchelolo) >>! In T112648#3044690, @GWicke wrote: > @Pchelolo, logging directly to a file is synchronous, which is bad for performance & can cause outages. See the earlier d...
[21:51:16] <wikibugs>	 (03PS5) 10Zppix: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700
[21:53:39] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:54:07] <wikibugs>	 (03PS13) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598
[21:55:03] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush)
[21:56:21] <wikibugs>	 (03CR) 10Zppix: [C: 031] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix)
[21:56:52] <wikibugs>	 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3044783 (10Eevans)
[21:57:55] <wikibugs>	 (03PS8) 10Gergő Tisza: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie)
[21:58:28] <wikibugs>	 (03CR) 10Dzahn: [C: 031] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix)
[21:58:40] <wikibugs>	 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044786 (10GWicke) @Pchelolo, in the disk-full situation, writing directly to files would still cause memory to build up & the service to run out of memory.
[21:59:48] <wikibugs>	 (03PS1) 10Dzahn: remove carbon's production IPs [dns] - 10https://gerrit.wikimedia.org/r/339063 (https://phabricator.wikimedia.org/T158020)
[22:01:22] <mutante>	 !log carbon - removed from icinga,  shutdown -h now  (T158020)
[22:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:28] <stashbot>	 T158020: decom carbon - https://phabricator.wikimedia.org/T158020
[22:01:41] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[22:03:09] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] remove carbon's production IPs [dns] - 10https://gerrit.wikimedia.org/r/339063 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn)
[22:04:25] <paladox>	 Hi, im seeing a restricted panel at https://phabricator.wikimedia.org
[22:04:26] <paladox>	 Missing or Restricted Panel	
[22:04:27] <paladox>	 This panel does not exist, or you do not have permission to see it.
[22:04:31] <icinga-wm>	 PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:04:33] <paladox>	 What is the panel?
[22:04:41] <paladox>	 twentyafterfour ^^
[22:04:42] <Zppix>	 paladox: what dashboard you use
[22:04:45] <Zppix>	 Cause mine is fine
[22:04:49] <paladox>	 Zppix im using the default
[22:04:54] <paladox>	 i haven't changed mine.
[22:05:00] <Zppix>	 Try using my dashboard it works fine
[22:05:13] <paladox>	 Oh, nope i like the default one :)
[22:05:14] <twentyafterfour>	 ?
[22:05:24] <paladox>	 I will take a screenshot
[22:05:27] <Zppix>	 paladox: i meant to test your permission
[22:06:50] <paladox>	 Zppix twentyafterfour https://phabricator.wikimedia.org/F5747915
[22:06:57] <paladox>	 it is near to the bottom
[22:06:59] <wikibugs_>	 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044796 (10Dzahn)
[22:07:09] <wikibugs_>	 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3023767 (10Dzahn)
[22:07:15] <paladox>	 Under Activity Feed
[22:07:23] <Zppix>	 Have you tried relogginf
[22:07:29] <Zppix>	 Relogging*
[22:08:24] <wikibugs_>	 (03PS9) 10Gergő Tisza: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie)
[22:08:29] <paladox>	 Zppix that wont help as one of the admins changed the dashbored.
[22:09:07] <wikibugs_>	 (03CR) 10Gergő Tisza: [C: 031] Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie)
[22:09:25] <Zppix>	 I cant see it im only a user
[22:09:39] <Zppix>	 So maybe twentyafterfour could
[22:09:45] <paladox>	 Zppix if your using a different dashbored to me, you wont be able to see it.
[22:10:07] <Zppix>	 I switched to default to look at it
[22:10:11] <Zppix>	 Same issue
[22:10:11] <wikibugs_>	 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3044807 (10Dzahn) a:05Dzahn>03RobH Hi @Robh see the check boxes above. could you disable the switch port and then hand over? Thanks!
[22:10:19] <wikibugs_>	 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3044810 (10Eevans) > Another problem that we have in repository management is the problem that a component can only contain one version of a binary package. That's problematic for long-term migrations, e.g. wh...
[22:11:58] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044819 (10Dzahn)
[22:12:10] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2076659 (10Dzahn) carbon is down: count: 5
[22:12:22] <twentyafterfour>	 I can't remove it
[22:12:26] <twentyafterfour>	 it's not a real panel
[22:12:28] <twentyafterfour>	 it's a glitch
[22:13:27] <Zppix>	 Well thats good
[22:13:36] <Zppix>	 Is it a cache issue
[22:13:45] <wikibugs_>	 (03PS1) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064
[22:14:32] <wikibugs_>	 (03PS2) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064
[22:14:57] <wikibugs_>	 (03CR) 10BryanDavis: [C: 032] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix)
[22:15:35] <paladox>	 twentyafterfour oh
[22:15:40] <paladox>	 thanks for fixing it
[22:16:09] <wikibugs_>	 (03Merged) 10jenkins-bot: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix)
[22:17:08] <Zppix>	 bd808: thanks now the github url will no longer haunt me
[22:17:36] <wikibugs_>	 (03CR) 10Thcipriani: [C: 04-1] "I have questions about cleaning l10nupdate files in the staging directories." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad)
[22:17:39] <bd808>	 that's a pretty bizarre thing to be haunted by ;)
[22:17:42] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044841 (10Dzahn)
[22:18:02] * bd808 is scared of nuclear winter and fast moving zombies
[22:18:09] <Zppix>	 bd808: well i hate github with a passion hence is why ive merged every tool ive made to gerrit
[22:18:26] <wikibugs_>	 (03PS4) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892
[22:18:32] <bd808>	 huh. I actually really like github's product
[22:18:42] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2088494 (10Dzahn)
[22:18:59] <paladox>	 Zppix how can you hate github? gerrit and github are good including differential.
[22:19:05] <paladox>	 differential and diffusion
[22:19:05] <Zppix>	 I like the idea but the execution is crap... anyway thats not why were here
[22:19:20] <Zppix>	 paladox: just gerrit has friendly ui in my opinion
[22:19:42] <paladox>	 Zppix oh. Well it's about to get freindler to mobile users.
[22:19:50] <bd808>	 !bash <    Zppix> paladox: just gerrit has friendly ui in my opinion
[22:19:50] <stashbot>	 bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AVpiwmEwQMK9DA-FKimk
[22:20:11] <Zppix>	 Lol why was that bashed lol
[22:20:18] <bd808>	 https://en.wikipedia.org/wiki/Stockholm_syndrome
[22:20:30] <paladox>	 bd808 what about polygerrit? That is way more freindler on desktop screens and mobiles.
[22:20:50] <bd808>	 the day I care about code review on mobile...
[22:20:52] <Zppix>	 paladox: what i want is gerrit app for wmf
[22:21:19] <twentyafterfour>	 lol /me is afraid of gerrit
[22:21:43] <Zppix>	 twentyafterfour: then dont join #wikimedia-releng
[22:21:44] <paladox>	 Zppix theres already a gerrit app for android but not wmf branded.
[22:21:47] <twentyafterfour>	 code review on mobile seems pretty weird to me
[22:21:49] <paladox>	 https://play.google.com/store/apps/details?id=com.ruesga.rview&hl=en_GB
[22:22:00] <Zppix>	 paladox: sorry unless ios is the new android :/
[22:22:08] <paladox>	 Zppix i never use android
[22:22:12] * paladox hates android
[22:22:15] <twentyafterfour>	 lol
[22:22:17] <Zppix>	 I love my iphone
[22:22:26] <twentyafterfour>	 so many opinions, so little time...
[22:22:59] <paladox>	 Zppix me two + the other one i had.
[22:23:18] <twentyafterfour>	 https://panic.com/prompt/  <-- use that and run gerrit via git cli
[22:23:24] <Zppix>	 twentyafterfour: atleast its not always 4:20
[22:24:04] <Zppix>	 No i like to avoid shell as much as possible
[22:24:05] <twentyafterfour>	 it's always 4:20 somewhere at least once an hour...
[22:24:23] <Zppix>	 I cant tell you how many tabs i use just so i can use shell
[22:24:25] <paladox>	 twentyafterfour lol, but when you in the car you could write something like shutdown or misspell
[22:24:50] <Zppix>	 paladox: meh eqiad is not an important datacenter anyway :P
[22:25:14] <wikibugs_>	 (03PS1) 10BryanDavis: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065
[22:25:16] <wikibugs_>	 (03PS1) 10BryanDavis: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066
[22:25:18] <wikibugs_>	 (03PS1) 10BryanDavis: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067
[22:25:22] <twentyafterfour>	 sudo + touch screen keyboard = disaster :D
[22:26:25] <paladox>	 twentyafterfour i payed £20 for https://itunes.apple.com/gb/app/alwaysonpc-firefox-flash-player/id324618793?mt=8
[22:26:29] <paladox>	 Zppix lol yes it is
[22:26:37] <paladox>	 wikipedia will crash if eqiad goes down.
[22:26:40] <Zppix>	 Wait so rm -f * in the puppetmaster for eqiad isnt what you wanted twentyafterfour
[22:26:46] <Zppix>	 paladox: ik i was joking
[22:27:11] <paladox>	 lol
[22:27:14] <twentyafterfour>	 paladox: you got ripped off, it's £8.99 now
[22:27:25] <paladox>	 Yeh i know, and i doint even use the thing
[22:27:30] <paladox>	 any more
[22:27:31] <Zppix>	 I just use safari
[22:27:37] <paladox>	 i only get 2gb of storage.
[22:27:39] * Zppix puts on sunglasses
[22:27:40] <twentyafterfour>	 paladox: we can survive eqiad going away but it won't be instant recovery
[22:27:45] <paladox>	 oh
[22:28:03] <Zppix>	 Doesnt the texas datacenter have the backup of eqiad?
[22:28:22] <twentyafterfour>	 Zppix: yes, everything is duplicated in codfw
[22:28:26] <paladox>	 twentyafterfour it will probaly be £20 again in 2 years any ways.
[22:28:33] <twentyafterfour>	 but I think it would take an hour or two to recover
[22:28:39] <paladox>	 yeh
[22:29:15] <wikibugs_>	 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044859 (10Dzahn)
[22:29:34] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 (owner: 10Rush)
[22:30:04] <jouncebot>	 MaxSem and Pchelolo: Respected human, time to deploy Kartotherian update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T2230). Please do the needful.
[22:30:12] <Zppix>	 twentyafterfour: you would probably need more than this operations team to recover it
[22:31:07] <wikibugs_>	 (03PS1) 10Tim Landscheidt: ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069
[22:31:32] <icinga-wm>	 RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[22:31:57] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044883 (10Dzahn)
[22:32:46] <wikibugs_>	 (03CR) 10Hashar: [C: 032] Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis)
[22:32:47] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2094545 (10Dzahn)
[22:32:55] <wikibugs_>	 (03CR) 10Tim Landscheidt: "Diff: diff -u <(git show production:modules/ganglia/lib/puppet/parser/functions/suffix.rb) modules/stdlib/lib/puppet/parser/functions/suff" [puppet] - 10https://gerrit.wikimedia.org/r/339069 (owner: 10Tim Landscheidt)
[22:33:38] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis)
[22:33:49] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2095303 (10Dzahn)
[22:33:54] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis)
[22:34:07] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis)
[22:34:25] <paladox>	 twentyafterfour i see now that polymer is in gwt. I do not get it why upstream will not do that. Polygerrit is all js, whereas gwt was implemented as java.
[22:35:07] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044890 (10Dzahn) @Zppix This ticket is for precise in production (i adjusted ticket title to clarify). For precise in labs please use T143349.
[22:36:01] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044896 (10Dzahn)
[22:37:02] <wikibugs_>	 (03CR) 10Tim Landscheidt: "(In fact some of the usages of $kafka_config['brokers']['array'] might now be replaceable by suffix($kafka_config['brokers'], '') because " [puppet] - 10https://gerrit.wikimedia.org/r/339069 (owner: 10Tim Landscheidt)
[22:37:10] <wikibugs_>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044897 (10Zppix) @Dzahn Ah once i added the subtask for here  i then started second guessing that, thanks for confirming my doubt.
[22:37:30] <wikibugs_>	 (03PS2) 10BryanDavis: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065
[22:37:32] <wikibugs_>	 (03PS2) 10BryanDavis: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066
[22:37:34] <wikibugs_>	 (03PS2) 10BryanDavis: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067
[22:38:09] <wikibugs_>	 (03PS5) 10Zppix: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt)
[22:39:19] <wikibugs_>	 (03CR) 10Hashar: "Hints about py3 support from pywikibot/core :)" (032 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis)
[22:39:26] <wikibugs_>	 (03Abandoned) 10BryanDavis: Ignore lighttpd-precise in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis)
[22:39:52] <hashar>	 bd808: hi! various python version is a bit of a mess. Luckily pywikibot people figured it out !
[22:40:00] <hashar>	 bd808: some tips and tricks at https://gerrit.wikimedia.org/r/#/c/339065/1/tox.ini :}
[22:40:28] <hashar>	 bd808: you can also use just "py3" and tox will run whatever version "python3" is
[22:41:51] <wikibugs_>	 (03PS2) 10Tim Landscheidt: Tools: Outfactor jobkill script to toollabs::node::all [puppet] - 10https://gerrit.wikimedia.org/r/335755
[22:42:03] <wikibugs_>	 (03CR) 10Hashar: [C: 031] "Looks good. See my note about flake8 being run with python2.  Might want to add another env running flake8 with python3.  That is what pyw" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis)
[22:43:18] <wikibugs_>	 (03PS3) 10Dzahn: Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul)
[22:43:48] <wikibugs_>	 (03CR) 10Dzahn: [V: 031 C: 031] Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul)
[22:43:51] <wikibugs_>	 (03CR) 10Dzahn: [V: 031 C: 032] Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul)
[22:43:57] <wikibugs_>	 (03CR) 10Chad: clean.py: Fix up l10nupdate-owned files on masters (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad)
[22:47:13] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044907 (10Zppix) With the information @CRoslof  provided I'm going to consider this task denied? Anyone disagree?
[22:48:28] <wikibugs_>	 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044908 (10RobH)
[22:48:44] <wikibugs_>	 (03PS1) 10BryanDavis: Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076
[22:49:13] <wikibugs_>	 (03CR) 10BryanDavis: [C: 032] "Flake8 on py3 followup in Ie8abe9934b2afe59333a238b1d01a38f118d6e93" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis)
[22:49:58] <wikibugs_>	 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3023767 (10RobH) a:05RobH>03Dzahn Assigning back to Daniel pending his approval to wipe the disks.  (I imagine this approval will follow once we have a waiting period and no one realizes they missed anything.)  When this is...
[22:50:33] <wikibugs_>	 (03Merged) 10jenkins-bot: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis)
[22:50:52] <bd808>	 jouncebot: next
[22:50:52] <jouncebot>	 In 1 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000)
[22:52:21] <icinga-wm>	 PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:53:53] <wikibugs_>	 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044912 (10RobH) a:05Dzahn>03Cmjohnson So this is now ready for wipe.  After discussion with Daniel, we have multiple backup options of this data, but we'll put a last chance date of March 1st.  Chris: Please do not wipe th...
[22:54:10] <wikibugs_>	 06Operations, 10ops-eqiad, 10hardware-requests: decom carbon - https://phabricator.wikimedia.org/T158020#3044915 (10RobH)
[22:55:05] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044918 (10Matthewrbowker) >>! In T158490#3042854, @Zppix wrote: >>>! In T158490#3039567, @Matthewrbowker wrote: >>>>! In T158490#3039326, @Z...
[22:55:13] <wikibugs_>	 (03CR) 10Hashar: [C: 032] "I can tell tell you are smarter than me :}" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 (owner: 10BryanDavis)
[22:56:04] <hashar>	 bd808: the IB3 library change, it is too late to properly review one. My understanding is you created a new lib that extract the useful bits from jouncebot . That is great
[22:56:33] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044919 (10MaxSem) 05Open>03declined
[22:56:42] <bd808>	 hashar: I'm going to test it live (old school) as soon as the venv update finishes
[22:56:54] <bd808>	 I switched stashbot to it earlier today with no issues
[22:57:40] <wikibugs_>	 (03CR) 10BryanDavis: "Testing via cherry-pick to tool" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis)
[22:58:32] <Zppix>	 paladox: bout that panel issue maybe filling a ticket could help
[22:58:42] <paladox>	 Zppix it's fixed now
[22:58:43] <wikibugs_>	 (03CR) 10Thcipriani: [C: 031] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad)
[22:58:49] <paladox>	 twentyafterfour fixed it :)
[22:58:55] <Zppix>	 Oh what was the issue paladox
[22:59:39] <paladox>	 [22:12:23]  <twentyafterfour>	I can't remove it
[22:59:40] <paladox>	 [22:12:27]  <twentyafterfour>	it's not a real panel
[22:59:40] <paladox>	 [22:12:28]  <twentyafterfour>	it's a glitch
[22:59:44] <paladox>	 Zppix ^^
[22:59:54] <Zppix>	 Ok but what caused it :D
[23:00:00] <hashar>	 bd808: yeah that is all super smart. Kudos!
[23:00:07] <paladox>	 Zppix not sure.
[23:00:10] <wikibugs_>	 (03CR) 10Hashar: [C: 031] Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis)
[23:00:36] <hashar>	 bd808: don't wait for my reviews :}  I am sleeping now!
[23:00:39] <wikibugs_>	 (03CR) 10Zppix: [C: 031] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad)
[23:00:50] <wikibugs_>	 (03CR) 10Thcipriani: [C: 031] "nitpick inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad)
[23:02:06] <wikibugs_>	 (03CR) 10Thcipriani: "comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad)
[23:04:01] <icinga-wm>	 PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:04:12] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[23:06:28] <chasemp>	 andrewbogott: ^?
[23:07:10] <wikibugs_>	 (03CR) 10Zppix: [C: 031] Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 (owner: 10BryanDavis)
[23:07:15] <wikibugs_>	 06Operations, 10ops-codfw, 10netops: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3044966 (10Papaul)
[23:07:17] <andrewbogott>	 chasemp: no idea, I'll look
[23:09:25] <wikibugs_>	 (03CR) 10Thcipriani: [C: 04-1] clean.py: Fix up l10nupdate-owned files on masters (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad)
[23:11:44] <wikibugs_>	 (03CR) 10Dzahn: "you are correct and i also like the consistency better, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[23:11:55] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] toolserver_legacy: Use Redirect instead of RedirectMatch [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[23:12:04] <wikibugs_>	 (03PS2) 10Dzahn: toolserver_legacy: Use Redirect instead of RedirectMatch [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt)
[23:14:09] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044987 (10Dzahn) Having multiple URLs for the same content is also bad for "SEO" and we already have w.wiki as a generic URL shortener.
[23:15:11] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[23:15:57] <wikibugs_>	 (03CR) 10Chad: [C: 032] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad)
[23:17:33] <wikibugs_>	 (03Merged) 10jenkins-bot: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad)
[23:17:42] <wikibugs_>	 (03CR) 10jenkins-bot: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad)
[23:19:29] <wikibugs_>	 (03PS3) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032
[23:19:39] <RainbowSprinkles>	 thcipriani: Renamed do_stuff() to execute_remote()
[23:19:40] <RainbowSprinkles>	 :p
[23:19:51] <Zppix>	 But do stuff is better RainbowSprinkles
[23:20:08] <RainbowSprinkles>	 do_stuff()
[23:20:12] <RainbowSprinkles>	 really_do_stuff()
[23:20:16] <RainbowSprinkles>	 do_stuff_2()
[23:20:17] <RainbowSprinkles>	 :)
[23:20:29] <Zppix>	 Insert_the_code_automatically()
[23:20:51] <Zppix>	 ^ thats a lifesaver
[23:21:22] <icinga-wm>	 RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[23:21:34] <logmsgbot>	 !log demon@tin Started scap: scap/plugins/clean.py Code cleanup
[23:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:45] <logmsgbot>	 !log demon@tin scap aborted: scap/plugins/clean.py Code cleanup (duration: 00m 10s)
[23:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:56] <thcipriani>	 RainbowSprinkles: thanks :)
[23:22:00] <RainbowSprinkles>	 Whoops didn't mean a full scap
[23:22:20] <Zppix>	 It would help :P
[23:22:57] <logmsgbot>	 !log demon@tin Synchronized scap/plugins/clean.py: Code cleanup (duration: 00m 46s)
[23:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:17] <wikibugs_>	 (03PS3) 10Gergő Tisza: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849)
[23:30:06] <MaxSem>	 !log Kartotherian deploy did not happen
[23:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:01] <icinga-wm>	 RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[23:34:57] <wikibugs_>	 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045037 (10GWicke)
[23:43:15] <bd808>	 jouncebot: next
[23:43:15] <jouncebot>	 In 0 hour(s) and 16 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000)
[23:43:19] <wikibugs_>	 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045070 (10GWicke) >>! In T66214#2981357, @Gilles wrote: > Something that's missing in the current plan, however, is the swift sharding information tha...
[23:43:39] <bd808>	 that was a "fun" bug in jouncebot :/
[23:43:44] <Zppix>	 Good job bd808 talk about last min
[23:44:21] <bd808>	 its running from my laptop at the moment. I'll get things in gerrit after SWAT is done
[23:44:30] <Zppix>	 Oh gh
[23:44:37] <Zppix>	 Is it that bad :P
[23:49:07] <wikibugs_>	 (03PS3) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064
[23:50:39] <Zppix>	 jouncebot: next
[23:50:39] <jouncebot>	 In 0 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000)
[23:54:37] <wikibugs_>	 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3045079 (10RobH)
[23:54:39] <wikibugs_>	 06Operations, 10ops-codfw, 10netops: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3045077 (10RobH) 05Open>03Resolved all ports have been enabled, had descriptions set, and placed in the private vlan for their respective rows.
[23:54:51] <wikibugs_>	 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3033850 (10RobH)
[23:54:54] <wikibugs_>	 (03PS2) 10Andrew Bogott: WIP:  Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918
[23:55:00] <RainbowSprinkles>	 Warning: Cannot modify header information - headers already sent in /srv/mediawiki/php-1.29.0-wmf.12/includes/GlobalFunctions.php on line 1791	1
[23:55:00] <RainbowSprinkles>	 Warning: Cannot modify header information - headers already sent in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/HttpStatus.php on line 111
[23:55:08] <RainbowSprinkles>	 thcipriani: I should move wfGetCaller() one level further up
[23:55:38] <RainbowSprinkles>	 Also, recording the header() we're sending is probably a good idea
[23:56:24] <wikibugs_>	 (03CR) 1020after4: [C: 031] clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad)
[23:56:31] <thcipriani>	 haven't dug into the logs on that one, not seeing anything good?
[23:56:45] <RainbowSprinkles>	 Nothing useful yet
[23:56:54] <RainbowSprinkles>	 The "already sent in" bit is just a wrapper for the caller itself
[23:57:07] <RainbowSprinkles>	 So should get the next highest caller
[23:58:02] <RainbowSprinkles>	 Eh, those errors aren't the same cuz they're not using WebResponse::header()
[23:58:08] <RainbowSprinkles>	 This is going to be cat and mouse :(
[23:59:00] <thcipriani>	 yeah, gonna be tricky to track down :\
[23:59:03] <tgr>	 RainbowSprinkles: Tim has a pending patch to do that, if you can wait with the debugging until next week
[23:59:11] <tgr>	 or want to merge & backport
[23:59:20] <RainbowSprinkles>	 Link? We can backport easy
[23:59:50] <tgr>	 https://gerrit.wikimedia.org/r/#/c/338705/
[23:59:57] <wikibugs_>	 (03CR) 10Dzahn: [V: 031 C: 031] "http://puppet-compiler.wmflabs.org/5520/" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4)