[00:05:35] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:35] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:09:51] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041762 (10brion) [00:19:35] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:18] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041776 (10bd808) I don't see anything at https://atlas.hashicorp.com/debian/boxes/contrib-jessie64 that explicitly says a re... [00:26:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [00:29:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:29:45] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 725 bytes in 0.083 second response time [00:33:25] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:45] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:34:35] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:45:35] (03Abandoned) 10Krinkle: Remove unused top6-wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 (owner: 10Krinkle) [00:47:35] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:51:47] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3041805 (10brion) [00:51:51] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Vagrant 1.8.7 fails to fetch Jessie image with vague error message - https://phabricator.wikimedia.org/T158608#3041803 (10brion) 05Open>03Invalid Running with --debug seems to indicate that Vagrant's downloader is failing to load cu... [00:54:35] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:55:49] (03PS3) 10Tim Starling: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) [00:56:45] (03CR) 10Tim Starling: "Bryan: Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [01:01:25] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:04:45] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [01:07:21] (03CR) 10Tim Starling: [C: 032] " TimStarling: yeah. give it a shot and lets see how fast the hard drive fills up :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [01:08:59] (03Merged) 10jenkins-bot: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [01:09:07] (03CR) 10jenkins-bot: Route PHP warnings from the handler into udp2log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [01:14:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [01:17:31] !log tstarling@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 42s) [01:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:18:59] No justification provided, snap. [01:23:35] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:33:45] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:55:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [01:58:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:58:40] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041900 (10brion) [02:03:30] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041913 (10brion) Note there is no `logs/mediawiki-runJobs.log` file, and I cannot connect to port 80 on 127.0.0.1 from within the... [02:09:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [02:12:06] (03PS1) 10Andrew Bogott: WIP: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [02:12:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [02:19:17] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 07m 20s) [02:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:25] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 21 02:24:37 UTC 2017 (duration 5m 20s) [02:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:10] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: Job runner service doesn't appear to work in jessie-migration - https://phabricator.wikimedia.org/T158615#3041919 (10bd808) Likely broken by {rMWVA1956f986abfe} where we dropped the port 80 bind. [02:33:55] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.500 second response time [02:38:55] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.382 second response time [02:48:25] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [03:10:55] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:24:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 646.40 seconds [03:27:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 230.32 seconds [03:30:35] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:37:55] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [03:50:25] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:45] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3041958 (10brion) [03:58:35] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:09:25] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:25] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:33:50] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team: npm install fails for changeprop service in MW-Vagrant jessie-migration - https://phabricator.wikimedia.org/T158617#3041968 (10brion) [04:38:25] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:38:55] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.232 second response time [04:48:56] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.810 second response time [05:06:35] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:15:45] (03PS1) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 [05:16:35] (03CR) 10Krinkle: "The last value in this schema was received in March 2016 per https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Dep" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle) [05:17:15] (03PS2) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 [05:17:57] (03CR) 10Krinkle: "@ops: What's the convention for ensuring this service is stopped? Or can we just do it manually?" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle) [05:34:35] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:45:45] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:46:35] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [05:46:55] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [06:02:08] is gerrit really slow or just for me? and git pull --rebase on the puppet repo has been failing for me for the last 2-3 hours. [06:13:45] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:38:56] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.308 second response time [06:44:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.077 second response time [06:50:25] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:12:19] (03PS1) 10Marostegui: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) [07:13:55] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.486 second response time [07:18:55] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.597 second response time [07:19:25] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:26:07] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:27:36] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:27:57] (03CR) 10jenkins-bot: db-codfw.php: Repool db2048, depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338934 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:29:05] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2048 and depool db2055 - T132416 (duration: 00m 51s) [07:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:11] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:31:37] !log Deploy alter table enwiki.revision db2055 - T132416 [07:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] (03CR) 10Muehlenhoff: "You could use" [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle) [08:06:35] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:15] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM but see unrelated change which slipped in?" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul) [08:22:21] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(27|32|37|41) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338811 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [08:24:06] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(27|32|37|41).eqiad.wmnet [08:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:25] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:27:19] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042192 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1027.eqiad.wmnet'] ``` The... [08:27:41] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042193 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The... [08:27:44] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042194 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1041.eqiad.wmnet'] ``` The... [08:27:59] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042195 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1037.eqiad.wmnet'] ``` The... [08:28:55] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.480 second response time [08:30:24] !log increasing concurrent recoveries / relocations to 8 on elasticsearch eqiad [08:30:25] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:30:27] (03CR) 10Gilles: [C: 031] webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 (owner: 10Krinkle) [08:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:19] !log upgrading mw1170-mw1208 to HHVM 3.12.14 [08:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:35] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:34:43] 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3040563 (10fgiunchedi) +1, sounds good to me! [08:38:55] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.455 second response time [08:41:26] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:48] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3042204 (10jcrespo) a:05akosiaris>03None [08:43:04] 06Operations, 10hardware-requests: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3042206 (10jcrespo) [08:43:06] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10jcrespo) [08:44:27] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042210 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1027.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1027.eqi... [08:44:29] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10jcrespo) I arrived here through cronspam, added T156506 as a subtask (meaning it depends on, it is obviously not a subtask) so others do not lose time next time. [08:48:58] (03PS1) 10Filippo Giunchedi: Revert "graphite: switch to graphite2001" [dns] - 10https://gerrit.wikimedia.org/r/338938 (https://phabricator.wikimedia.org/T157022) [08:52:14] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042218 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1041.eqiad.wmnet'] ``` and were **ALL** successful. [08:52:17] (03CR) 10Filippo Giunchedi: [C: 032] Revert "graphite: switch to graphite2001" [dns] - 10https://gerrit.wikimedia.org/r/338938 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [08:52:25] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:52:34] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042219 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1037.eqiad.wmnet'] ``` and were **ALL** successful. [08:53:27] !log switch statsd/graphite DNS to graphite1001 - T157022 [08:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:33] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [09:01:35] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [09:07:05] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042240 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1032.eqi... [09:13:19] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042241 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1027.eqiad.wmnet'] ``` The... [09:13:21] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042242 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The... [09:16:37] godog: need to restart all the jmxtrans on analytics to pick up graphite1001? [09:18:30] elukey: heh I think so, let's wait another 45m or so for the ttl to be fully expired and see what still goes to graphite2001 [09:18:50] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042244 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1032.eqi... [09:19:39] godog: sure! Let me know if I have to run a round of restarts to help [09:20:52] PROBLEM - Elasticsearch HTTPS on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:21:16] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042245 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1032.eqiad.wmnet'] ``` The... [09:21:18] ^elastic1032 is me - reimage in progress... [09:26:00] !log cp3030: libssl1.1 upgraded to 1.1.0e-1+wmf1, libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1 [09:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:46] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:35:36] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:49] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042278 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1027.eqiad.wmnet'] ``` and were **ALL** successful. [09:36:16] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:36:44] checking --^ [09:36:50] it's me, upgrading [09:36:59] it's depooled [09:37:05] \o/ [09:37:09] thanks :) [09:37:16] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:37:43] moritzm: does conf-tool have any effect on jobrunners? [09:38:16] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:38:18] (except DSH) [09:39:16] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:39:36] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [09:39:37] (03PS2) 10Hashar: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 [09:41:20] (03CR) 10Elukey: [C: 031] "Weird that ${$mirror_name} didn't trigger any issue, it must evaluate as ${mirror_name} (I checked on kafka1001 and the substitution was r" [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [09:42:56] PROBLEM - HHVM jobrunner on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:43:56] RECOVERY - HHVM jobrunner on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:44:26] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3042303 (10Addshore) [09:46:30] !log upgrade graphite on graphite1001 and bounce carbon daemons [09:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:54] I have some funny oddity with graphite :-} [09:50:41] the .upper metric looks off using the week bucket bah [09:51:09] godog: you have been upgrading Graphite haven't you ? [09:51:13] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2776612 (10aude) @Lydia_Pintscher only thing that changes is that interwiki links are sorted in all names... [09:51:24] or Carbon or statsd maybe? [09:52:06] hashar: the former yeah [09:53:06] the .upper is behaving strangely with the last 7 days of data [09:53:11] but the long aggregation looks fine [09:53:12] https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=13&fullscreen&from=now-14d&to=now :} [09:53:29] should be around 2 minutes [09:53:52] but for the last 7 days of data (which I assume is the bucket of 7 days that keeps the per minute data points) it does not seem to keep the max [09:54:06] will fill a bug/ try to reproduce with the graphite web iface [09:54:54] hashar: yeah please file a bug for that! [09:55:28] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3042332 (10Addshore) Indeed! [09:55:44] median is not affected :} [09:57:32] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#3042338 (10Addshore) 05Open>03Resolved a:03Addshore [09:58:01] !log upgrading mira/tin to HHVM 3.12.14 [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:36] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:36] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:08:11] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042390 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1032.eqiad.wmnet'] ``` and were **ALL** successful. [10:08:36] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:10:57] (03PS1) 10Jcrespo: Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941 [10:11:29] godog: filled https://phabricator.wikimedia.org/T158633 I can live without .upper .lower for a while. The .median is apparently not affected :} [10:12:15] hashar: ok thanks! [10:12:37] I haven't tried on other time based metrics [10:13:16] RECOVERY - Elasticsearch HTTPS on elastic1032 is OK: SSL OK - Certificate elastic1032.eqiad.wmnet valid until 2022-02-20 10:11:41 +0000 (expires in 1824 days) [10:13:31] (03CR) 10Marostegui: [C: 031] Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941 (owner: 10Jcrespo) [10:14:16] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 25 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [10:14:52] !log downgrade carbon-c-relay on graphite1001 to trusty's version and bounce daemons [10:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:13] (03PS1) 10Jcrespo: Update s5-pager to the latest version (including extra indexes) [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747) [10:19:16] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:22:13] (03PS2) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [10:24:50] (03CR) 10Jcrespo: [C: 032] Change hosts list to use tab-separated format, update them [software] - 10https://gerrit.wikimedia.org/r/338941 (owner: 10Jcrespo) [10:27:36] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:29:53] !log restarting base services on mw2* after openssl update [10:29:56] elukey: yeah if you could start a rolling restart of jmxtrans that'd be nice! make sure to mention T157022 in !log so we can keep track of it [10:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:59] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:30:03] I have to go in ~15 min [10:31:06] yessir! [10:55:50] !log rolling restart of the analyics jmxtrans daemons for T157022 [10:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:55] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:57:28] (03PS1) 10Subramanya Sastry: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 [10:59:25] (03PS3) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [11:02:14] !log rolling restart of cassandra-metrics-collector on aqs1* for T157022 [11:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:20] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:03:01] godog: I've done the whole Hadoop cluster, Druid and AQS nodes.. IIRC it should be enough, but let me know if I am missing something [11:05:12] !log upgrading openssl on hadoop cluster / various base service restarts [11:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:16] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:23] !log upgrading openssl on restbase clusters / various base service restarts [11:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:36] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:18:56] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.235 second response time [11:19:44] second coffee in a 2 hours row and still feeling asleep :S [11:26:29] (03PS1) 10Jdrewniak: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) [11:27:20] tabbycat: then you should have an half an hour sleep session :} [11:27:50] tabbycat: or try to get outside for a while for some fresh air. That might wake you up ? :) [11:29:00] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.876 second response time [11:29:51] (03PS22) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:29:59] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Kaarel_Vaidla) [11:32:52] !log upgrading openssl on kafka clusters / various base service restarts [11:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:44] (03PS2) 10Elukey: Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023) [11:34:20] (03PS23) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:36:07] (03CR) 10Elukey: [C: 032] Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [11:37:25] (03PS1) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) [11:39:10] (03CR) 10jerkins-bot: [V: 04-1] cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema) [11:39:57] 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3040563 (10faidon) `apt::repository` has a `comment_old` option that comments out the line in sources.list, and `Apt::Repository[wikimedia]` sets that to `true`, so this should be the case already. If it's not, it's probably... [11:40:20] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:40:36] (03CR) 10Volans: "Last puppet compiler, including the require of cumin's package:" [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:42:21] (03PS2) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) [11:42:40] PROBLEM - DPKG on poolcounter1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:43:01] ^ poolcounter1002 is harmless, update in progress [11:43:40] RECOVERY - DPKG on poolcounter1002 is OK: All packages OK [11:45:40] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:49:40] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:22] (03PS24) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:54:27] (03CR) 10Muehlenhoff: [C: 031] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:57:19] (03CR) 10Ema: "PCC output here https://puppet-compiler.wmflabs.org/5518/" [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema) [12:01:06] !log temporarily disabled puppet on neodymium and puppetmaster1001 to merge Gerrit 330436 T154588 [12:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:12] T154588: Automation framework first version - https://phabricator.wikimedia.org/T154588 [12:02:13] (03PS25) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [12:03:58] (03CR) 10Volans: [C: 032] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:05:00] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:40] PROBLEM - Check systemd state on mw2245 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:50] PROBLEM - Check systemd state on mw1255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:50] PROBLEM - Check systemd state on mw2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:50] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:51] PROBLEM - Check systemd state on db1085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:51] PROBLEM - Check systemd state on sarin is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:07:52] that's ferm, chcecking [12:07:58] argh... [12:08:00] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:08:10] PROBLEM - Check systemd state on mc2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:08:10] PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:08:11] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3042597 (10kruusamagi) I removed the date info from the main page of Estonian Wikipedia, but it only helps to hide the issue and not to solve it (the weekl... [12:08:32] !log stopped ircecho temporarily while fixing ferm [12:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:19] (03PS1) 10Volans: Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588) [12:12:19] (03CR) 10Muehlenhoff: [C: 031] Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:12:36] (03CR) 10Volans: [C: 032] Cumin: fix ferm service srange [puppet] - 10https://gerrit.wikimedia.org/r/338956 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:14:21] (03PS1) 10Gilles: Increase SWIFT_RETRIES in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/338957 (https://phabricator.wikimedia.org/T157949) [12:14:39] fix works properly [12:28:16] volans: still lots of CRITs in icinga, is that because of ferm? [12:28:58] ema: yes, I'm running puppet, they're recovering [12:29:08] ok [12:29:08] sorry about the mess [12:29:12] no worries [12:31:24] RECOVERY - Check systemd state on mw1225 is OK: OK - running: The system is fully operational [12:31:24] RECOVERY - Check systemd state on darmstadtium is OK: OK - running: The system is fully operational [12:31:24] RECOVERY - Check systemd state on mw1162 is OK: OK - running: The system is fully operational [12:31:24] RECOVERY - Check systemd state on mw1263 is OK: OK - running: The system is fully operational [12:31:24] RECOVERY - Check systemd state on mw1271 is OK: OK - running: The system is fully operational [12:31:25] RECOVERY - Check systemd state on argon is OK: OK - running: The system is fully operational [12:31:25] RECOVERY - Check systemd state on mw2105 is OK: OK - running: The system is fully operational [12:31:37] shut up ircecho :) [12:38:33] (03PS1) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [12:39:38] !log reenabled ircecho aftrer fixing ferm issue and run puppet on affected hosts [12:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] ema: all done, back to normal [12:40:09] volans: nice, thanks! [12:40:31] (03PS2) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [12:42:08] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [12:51:50] !log re-enabled puppet on planet2001, was disabled since a week without reason [12:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:51] !log re-enabled puppet on neodymium and puppetmaster1001 after Gerrit 330436 was merged T154588 [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:56] T154588: Automation framework first version - https://phabricator.wikimedia.org/T154588 [12:55:03] !log upgrading openssl on database servers / various base service restarts [12:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:09] (03PS1) 10Phuedx: Enable "reading depth" logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) [13:04:59] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:06:53] !log upgrading openssl on parsoid clusters / various base service restarts [13:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:59] (03CR) 10Paladox: [C: 031] "@Jcrespo hi, we can still merge this and at a later date put this into a separate repo. But as you pointed out your buisy so why reject th" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [13:15:15] (03PS2) 10Phuedx: Enable ReadingDepth instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) [13:20:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [13:21:44] !log upgrading openssl on aqs cluster / various base service restarts [13:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:05] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3042690 (10Paladox) We can still do https://gerrit.wikimedia.org/r/#/c/336002/ since I doint see an urgency to ha... [13:23:39] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [13:25:29] (03PS3) 10Phuedx: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) [13:26:23] (03CR) 10Bmansurov: [C: 031] Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [13:31:59] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:37:09] PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Cannot connect to keyholder-proxy socket /run/keyholder/proxy.sock. [13:41:05] !log restarting nodejs on aqs1* to pick up openssl security upgrades [13:41:09] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:59] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 3 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2947598 (10faidon) >>! In T155578#3037355, @EBernhardson wrote: > I poked paravoid about if we had any better solutions this time aroun... [13:42:38] (03CR) 10Hashar: Support Jenkins install from 'experimental' component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [13:45:09] PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [13:45:16] (03PS3) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [13:46:09] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [13:46:53] (03PS6) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 [13:46:55] (03PS9) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [13:46:57] (03PS3) 10Hashar: jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836 [13:48:24] (03PS4) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [13:49:39] keyholder on sarin it's me [13:57:33] * kart_ waves for SWAT [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1400). [14:00:04] kart_, addshore, jan_drewniak, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:14] o/ [14:00:14] o/ [14:00:19] \0 [14:00:21] o/ [14:00:30] kart_: doing your [14:00:31] o/ [14:00:36] cool. [14:00:50] hashar: looks like you are in charge, ping me if you need help :) [14:00:50] !log upgrading openssl on maps clusters / various base service restarts [14:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:15] (03PS1) 10Jcrespo: prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 [14:01:34] kart_: your patch is now on mwdebug1001 [14:01:57] looking at phuedx one [14:02:10] hashar: sure. Testing. Give me few minutes. [14:02:27] our configuration files are insane [14:03:12] hashar: ^ [14:03:19] 👍 [14:03:23] (03CR) 10Jcrespo: [C: 04-1] "Filippo: let's talk, we probably need to do some changes to prometheus configuration for mysql-exporter connection, both the socket and in" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo) [14:03:57] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [14:04:49] (03PS2) 10Hashar: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:04:58] jan_drewniak: and I rebased your portal patch :} [14:05:19] (03Merged) 10jenkins-bot: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [14:06:14] phuedx: ReadingDepth logging is enabled on mwdebug1001 [14:06:25] phuedx: then I guess there is no good way to test it is there? [14:06:56] hashar: there's no clean way to test it -- i can see what values are getting forwarded to the client [14:07:02] jan_drewniak: do you want to deploy the change yourself or should I ? [14:07:14] phuedx: guess I can just push it cluster wide [14:07:21] (03CR) 10jenkins-bot: Enable ReadingDepth logging on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338966 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [14:07:37] hashar: could you? thanks [14:07:43] hashar: go ahead. [14:07:44] (03CR) 10Hashar: [C: 032] "It is SWAT time :}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:07:49] jan_drewniak: I will [14:08:00] kart_: thanks for the test :}  Pushing it to prod [14:08:21] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable ReadingDepth logging on Wikipedias - T148262 T155639 (duration: 00m 45s) [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:26] T148262: Vet and explore new readership engagement metric - https://phabricator.wikimedia.org/T148262 [14:08:26] T155639: Create reading depth schema - https://phabricator.wikimedia.org/T155639 [14:08:51] (03Merged) 10jenkins-bot: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:08:59] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:01] 06Operations, 10netops: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3042814 (10faidon) [14:09:43] (03CR) 10jenkins-bot: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338951 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:09:49] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:58] (03CR) 10Jcrespo: [C: 04-1] "I do not see any urgency on this and unblocking it- the bug it will fix will be blocked anyway by many alter table , that I also have to d" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [14:10:14] !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/: Fix broken site picks feature for compact language links (duration: 01m 04s) [14:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:39] hashar: thanks ❤️ [14:11:02] ;D [14:11:14] jan_drewniak: I have pushed it to mwdebug1001 [14:11:47] looks like it is all fine [14:12:01] hashar: yup! [14:12:37] !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s) [14:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:18] !log hashar@tin Synchronized portals: (no justification provided) (duration: 00m 41s) [14:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:56] hashar: If we want to run script on a wiki, that script has to be in same branch as wiki in production (script will be in production tomorrow)? [14:18:08] kart_: it depends on the script I guess ? :} [14:18:14] jan_drewniak: completed :} [14:18:38] (03PS2) 10Hashar: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore) [14:18:51] (03CR) 10Hashar: [C: 032] Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore) [14:19:50] *waves* [14:20:01] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [14:20:12] (03Merged) 10jenkins-bot: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore) [14:20:13] apparently I missed the first ping! [14:20:20] (03CR) 10jenkins-bot: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) (owner: 10Addshore) [14:20:52] jouncebot: now [14:20:52] For the next 0 hour(s) and 39 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1400) [14:20:59] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3871050 keys, up 113 days 5 hours - replication_delay is 0 [14:21:33] hashar: script is at, https://gerrit.wikimedia.org/r/#/c/336073/ and to be run on svwiki [14:22:11] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042853 (10Zppix) >>! In T158490#3039427, @Platonides wrote: > MW is the country-code of Malawi (ISO 3166-1), so I find unlikely we would be... [14:22:25] addshore: deploying :} [14:22:29] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042854 (10Zppix) >>! In T158490#3039567, @Matthewrbowker wrote: >>>! In T158490#3039326, @Zppix wrote: >> @Aklapper I meant like if abbrev'd... [14:22:30] thanks! [14:22:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [14:22:52] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable TwoColConflict extension on arwiki - T158493 (duration: 00m 40s) [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] T158493: Deploy TwoColConflict beta to Arabic Wikipedia - https://phabricator.wikimedia.org/T158493 [14:23:27] kart_: you would have to backport the script from the master branch to the branch that svwiki is running . That is wmf.12 ( http://tools.wmflabs.org/versions/ ) [14:23:29] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 3456 MB (12% inode=96%) [14:23:42] kart_: then CR+2 , deploy it and you will be able to run the script on terbium [14:24:01] assuming the script does not depend on some code that got introduced between wmf.12 and master [14:24:06] hashar: nice. I'll do that as a part of SWAT. [14:24:13] hashar: no. It doesn't. [14:24:14] sure thing! [14:24:18] Thanks! [14:24:28] let me do the dance :D [14:24:48] https://gerrit.wikimedia.org/r/#/c/338971/1 and CR+2 [14:25:04] !log upgrading openssl on memcached clusters / various base service restarts [14:25:06] Oops :) [14:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:15] hashar: I also did cherry-pick :/ [14:25:29] hehe [14:25:39] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [14:26:04] ^ looking [14:26:19] hashar: Let's keep it for tomorrow? [14:26:28] chasemp: it has been flapping for hours [14:26:37] (fyi, since you're investigating) [14:26:42] thanks [14:26:44] kart_: I will deploy it and you can run it later today or tomorrow :) [14:26:53] hashar: Okay. cool. [14:27:09] hashar: also question, I don't think the portals sync-script worked :/ [14:27:20] eeek [14:27:42] jan_drewniak: maybe it failed to purge the URLs? [14:28:22] * hashar tries again [14:28:57] !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s) [14:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:05] !log restarting NTP servers on dns_recursors to pick up openssl update (one by one) [14:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:29] RECOVERY - Disk space on elastic1030 is OK: DISK OK [14:29:37] !log hashar@tin Synchronized portals: (no justification provided) (duration: 00m 40s) [14:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:43] !log truncated main log file on elastic1030 [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:02] jan_drewniak: I redeployed and https://www.wikipedia.org/ should have been purged [14:30:14] jan_drewniak: maybe other urls need a purge as well? [14:32:28] (03PS1) 10Rush: labstore: 1001 and 1002 are currently idle [puppet] - 10https://gerrit.wikimedia.org/r/338973 [14:32:49] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:33:00] the script syncs `portals/prod/wikipedia.org/assets $*` and `portals $*` maybe it needs `portals/prod/wikipedia.org $*` ? [14:33:49] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.000232 secs [14:35:17] !log upgrading openssl on logstash cluster / various base service restarts [14:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:59] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:37:00] hashar: did you cherry pick script to wmf12? [14:38:49] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:39:14] kart_: waited for it to merge [14:39:17] still waiting :} [14:39:59] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:40:39] hashar: now :) [14:40:59] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 2e-06 secs [14:41:37] kart_: deploying [14:42:38] syncing [14:43:14] !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php: Add a maintenance script for opt-in T133031 (duration: 00m 41s) [14:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] T133031: Preference conversion for Compact Language Links - https://phabricator.wikimedia.org/T133031 [14:43:58] hashar: do we have space for one more patch? [14:44:03] yu [14:44:05] yes [14:44:14] and the ULS maintenance script should now be on terbium [14:44:39] hashar: cool. [14:44:59] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 152299 [14:45:53] hashar: waiting for Jenkins for, https://gerrit.wikimedia.org/r/#/c/338974/1 [14:46:31] (03PS1) 10Muehlenhoff: Fix debdeploy group for kubernetes-mastes [puppet] - 10https://gerrit.wikimedia.org/r/338975 [14:47:49] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:19] kart_: arharhh [14:48:33] that code has an horrible look'n feel :D [14:49:00] (03CR) 10Muehlenhoff: [C: 032] Fix debdeploy group for kubernetes-mastes [puppet] - 10https://gerrit.wikimedia.org/r/338975 (owner: 10Muehlenhoff) [14:50:35] hashar: which code? [14:52:38] 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042911 (10hashar) [14:52:51] 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042924 (10hashar) [14:53:14] 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3042911 (10hashar) [14:53:17] hashar: we also have a task to make it more sane ;) [14:55:46] (03PS1) 10Gehel: elasticsearch - reimage elastic10(33|34|38|42) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338977 (https://phabricator.wikimedia.org/T151326) [14:56:14] Nikerabbit: kart_ I have CR+2 the wmf.12 patch https://gerrit.wikimedia.org/r/#/c/338976/ [14:56:40] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(33|34|38|42).eqiad.wmnet [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] hashar: okay! Let me know once on mwdebug1001 [15:00:12] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(33|34|38|42) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338977 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:02:25] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042981 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet'] ``` The... [15:02:33] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3042984 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1034.eqiad.wmnet'] ``` The... [15:02:37] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 13Patch-For-Review, and 2 others: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#3042986 (10Addshore) [15:04:10] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043007 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1038.eqiad.wmnet'] ``` The... [15:04:19] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043008 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The... [15:05:09] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:39] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:06:25] !log Increased manually maximum httpd keep alive requests and timeout on bohrium (piwik) - T154558 [15:06:30] !log roll-restart restbase after statsd move to graphite1001 - T157022 [15:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:30] T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558 [15:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:34] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [15:06:37] !log restarting kartotherian / tilerator(ui) on maps-test* [15:06:39] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:47] phew for a moment I though I got the task number wrong [15:07:03] godog: lol [15:07:03] ahahaha [15:07:26] godog: nah it was just elukey's early april fools joke [15:07:38] Zppix: hehe early april [15:07:39] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.0002 secs [15:08:22] 06Operations, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3043033 (10hashar) p:05Triage>03Low a:03hashar [15:08:26] (03PS1) 10Hashar: mediawiki-firejail: lint python scripts [puppet] - 10https://gerrit.wikimedia.org/r/338978 (https://phabricator.wikimedia.org/T158649) [15:08:28] (03PS1) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) [15:08:30] (03PS1) 10Hashar: mediawiki-firejail: quiet firejail [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) [15:09:33] moritzm: more patches for you ^^^ :D firejail emits a message to stderr that ends up in logstash hhvm logs :D [15:09:55] !log restarting kartotherian / tilerator(ui) on maps2* [15:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:01] hashar: thanks, I'll have a look this evening [15:11:31] (03CR) 10Hashar: "I gave it a try on Jessie and they work. Not sure whether those scripts are used on Trusty on which the firejail command might not suppor" [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [15:11:48] hashar: patch on wmf12 merged. [15:11:58] (03CR) 10Hashar: "I have no idea whether we are interested in catching firejail stdout which --quiet disable as well." [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [15:12:15] kart_: yeah pushing to mwdebug1001 [15:12:40] !log restarting kartotherian / tilerator(ui) on maps1* [15:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:03] kart_: done. it is on mwdebug1001 now. [15:14:47] (03CR) 10Filippo Giunchedi: "@Jcrespo, indeed! The socket should be enough to tweak hieradata for the labs::db roles, we'll need to check socket auth though" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo) [15:15:21] hashar: go ahead. OK this time! [15:15:32] (ie really tested) [15:15:49] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:17:00] !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/UniversalLanguageSelector/UniversalLanguageSelector.hooks.php: Fix site picks: missing from globals (duration: 01m 00s) [15:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:21] kart_: done :) [15:17:26] !log European SWAT complete [15:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:00] !log mobrovac@tin Started restart [mathoid/deploy@ba3217e]: Restarting for Graphite DNS switch T157022 [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:05] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [15:18:19] (03CR) 10Filippo Giunchedi: [C: 031] "Also FWIW this change can go ahead, prometheus will scrape prometheus-mysqld-exporter successfully. The failure to contact mysql itself is" [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo) [15:18:25] hashar: thanks! [15:18:39] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:19:23] !log mobrovac@tin Started restart [citoid/deploy@95df861]: Restarting for Graphite DNS switch T157022 [15:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:09] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disables drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043080 (10Paladox) [15:20:22] !log rolling restart of swift frontend servers to pick up openssl update [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:32] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1034.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1034.eqi... [15:21:17] !log mobrovac@tin Started restart [cxserver/deploy@0e4ae4f]: Restarting for Graphite DNS switch T157022 [15:21:24] kart_: fyi ^ [15:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:30] !log restart eventlogging on kafka200[123] for openssl upgrades [15:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:41] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disables drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043111 (10Paladox) Here is the changes https://gerrit-review.googlesource.com/#/q/topic:private-changes+(status:open+OR+status:merged) that will bring support for priva... [15:24:00] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3043112 (10Paladox) [15:24:19] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:24:26] mobrovac: thanks. Any action from us? [15:24:39] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:01] kart_: nope, just switching stats backends [15:26:19] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -0.001388 secs [15:26:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [15:27:35] mobrovac: kafka2001 (sorry I was already working on it) has been depooled, restarted, waited a bit, repooled and checked with httpry. Everything seems good [15:27:46] I am going to do the rest on kafka100[123] first [15:28:19] err too many things, it was kafka1001 indeed [15:28:38] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043122 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1038.eqiad.wmnet'] ``` and were **ALL** successful. [15:29:08] lol elukey [15:29:18] elukey: kk, both eventbus and cp look good [15:29:41] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [15:30:20] !log mobrovac@tin Started restart [graphoid/deploy@da37386]: Restarting for Graphite DNS switch T157022 [15:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:24] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [15:31:21] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:32:08] !log correction on my previous entry: restart eventlogging on kafka100[123] for openssl upgrades [15:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:28] this is why I got confused, I've read the log and though "snap I did the wrong thing!" [15:32:32] need coffee [15:32:42] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043129 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1042.eqi... [15:33:01] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:33:07] kafka1002 done [15:33:10] all good.. [15:33:14] going to finish with 1003 [15:33:21] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.000525 secs [15:34:41] !log mobrovac@tin Started restart [mobileapps/deploy@cd3b897]: Restarting for Graphite DNS switch T157022 [15:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:06] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043172 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ``` and were **ALL** successful. [15:36:19] mobrovac: eqiad done :) [15:36:39] elukey: double \o/ as all is looking good [15:37:00] hm, weird how happy we are when technology works as it's supposed to [15:37:04] makes one wonder ... [15:37:24] mobrovac: proceeding with codfw ok? [15:37:30] 06Operations, 06Operations-Software-Development: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043173 (10Volans) p:05Triage>03High a:03Volans [15:37:50] elukey: kk [15:38:08] (03PS1) 10Volans: Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 [15:38:41] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:39:06] (03PS2) 10Jcrespo: prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 [15:39:23] !log restart jmxtrans on kafka[12]00[123] for T157022 [15:39:24] (03CR) 10Jcrespo: [C: 032] prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo) [15:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:28] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [15:40:05] !log restart navtiming ve asset-check statsd-mw-js-deprecate on hafnium to pick up statsd.eqiad.wmnet change - T157022 [15:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:12] !log restart eventlogging on kafka200[123] for openssl upgrades [15:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:41] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.000217 secs [15:42:18] (03PS1) 10Giuseppe Lavagetto: Only output "changed" values if actually changed [software/conftool] - 10https://gerrit.wikimedia.org/r/338985 [15:42:20] (03PS1) 10Giuseppe Lavagetto: Add explicit dependencies [WiP] [software/conftool] - 10https://gerrit.wikimedia.org/r/338986 [15:42:35] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043191 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The... [15:42:40] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043192 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1034.eqiad.wmnet'] ``` The... [15:43:06] (03CR) 10Jcrespo: [V: 032 C: 032] prometheus-mysql-exporter: Add labsdb1005, just upgraded from precise [puppet] - 10https://gerrit.wikimedia.org/r/338970 (owner: 10Jcrespo) [15:43:34] mobrovac: I don't really see a lot of traffic on kafka200[123] for EL [15:43:57] elukey: EL? [15:44:14] eventlogging or the http service, as you want to call it :D [15:44:41] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [15:47:01] PROBLEM - DPKG on rhenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:47:41] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [15:48:01] RECOVERY - DPKG on rhenium is OK: All packages OK [15:48:41] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [15:48:54] * gehel having a look at wdqs1002... [15:49:15] (03CR) 10jerkins-bot: [V: 04-1] Add explicit dependencies [WiP] [software/conftool] - 10https://gerrit.wikimedia.org/r/338986 (owner: 10Giuseppe Lavagetto) [15:50:41] mobrovac: scb looking good, thanks! would you have time for ores and parsoid as well? if not I'll roll-restart in ~1h after a meeting [15:50:41] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [15:50:52] !log restarting wdqs-updater on wdqs1002 [15:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:35] godog: i think you better ping Amir1 for ores, I can do parsoid 2h from now [15:51:58] hey, what I need to do for Ores [15:52:07] mobrovac: ack thanks! I'll ping you if I don't get to do parsoid [15:52:24] Amir1: hey, we'd need a simple rolling-restart for ores to pick up DNS changes for statsd.eqiad.wmnet [15:53:31] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3570.30 Read Requests/Sec=3083.90 Write Requests/Sec=35.20 KBytes Read/Sec=28135.20 KBytes_Written/Sec=7730.00 [15:53:41] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:54:59] godog: Okay, I do it now, if it's okay. [15:58:08] Amir1: sure, thanks! [16:03:45] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#3043262 (10faidon) @MoritzMuehlenhoff, any news from openldap-technical or in general about this? [16:05:51] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 2183 MB (8% inode=96%) [16:06:51] !log truncated main log file on elastic1030 [16:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:31] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=114.40 Read Requests/Sec=193.80 Write Requests/Sec=3.20 KBytes Read/Sec=2030.00 KBytes_Written/Sec=373.60 [16:08:09] !log restarting apache on uranium for openssl update [16:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:14] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1034.eqiad.wmnet'] ``` and were **ALL** successful. [16:08:54] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#3043271 (10MoritzMuehlenhoff) No yet, no. [16:11:21] 06Operations, 10DBA, 13Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#3043273 (10jcrespo) [16:11:26] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3043272 (10jcrespo) [16:11:28] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3043274 (10jcrespo) [16:15:31] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [16:15:51] RECOVERY - Disk space on elastic1030 is OK: DISK OK [16:17:51] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3043279 (10faidon) @Dzahn, what's the status of this? [16:18:16] !log truncated main elastic log, daemon.log and syslog on elastic1023 [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:23] (03PS1) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702) [16:18:31] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:31] RECOVERY - Disk space on elastic1023 is OK: DISK OK [16:21:11] PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Cannot connect to keyholder-proxy socket /run/keyholder/proxy.sock. [16:23:07] (03PS2) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702) [16:26:01] PROBLEM - DPKG on dataset1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:26:15] ^expected, update in progress [16:26:23] (03PS2) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 [16:27:01] RECOVERY - DPKG on dataset1001 is OK: All packages OK [16:27:01] (03CR) 10Giuseppe Lavagetto: prometheus: add etcd metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [16:27:08] <_joe_> godog: care to re-review? ^^ [16:27:19] <_joe_> I'd like to take it to production [16:27:30] <_joe_> tomorrow is fine ofc [16:27:31] <_joe_> :) [16:27:51] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 518 MB (1% inode=96%) [16:28:46] will it be ok with so little space? [16:29:21] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3043290 (10Gehel) [16:29:31] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3043303 (10Gehel) p:05Triage>03High [16:29:46] dcausse already truncated the log earlier the day [16:29:51] oh [16:29:53] it is / [16:29:58] I missread it as /srv [16:30:06] not worried, then [16:30:21] * gehel is slightly worried, but not too much :) [16:30:47] well, if logs are lost is bad, if a service goes down is worse :-) [16:31:34] !log truncating elasticsearch logs on elastic1030 [16:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:51] RECOVERY - Disk space on elastic1030 is OK: DISK OK [16:32:12] (03PS5) 10Ottomata: Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric) [16:32:31] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [16:33:21] (03PS6) 10Ottomata: Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric) [16:34:20] !log truncating elasticsearch logs on elastic1023 [16:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:31] RECOVERY - Disk space on elastic1023 is OK: DISK OK [16:34:51] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 1478 MB (5% inode=96%) [16:36:13] (03PS3) 10Jcrespo: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702) [16:36:17] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [16:37:17] !log restarting elasticsearch on elastic1030 [16:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:31] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: / 1456 MB (5% inode=96%) [16:38:41] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [16:39:51] RECOVERY - Disk space on elastic1030 is OK: DISK OK [16:40:31] RECOVERY - Disk space on elastic1023 is OK: DISK OK [16:41:24] ok, we should be good again on those logs... it is more than time to upgrade to elastic 5! [16:42:27] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043355 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1042.eqiad.wmnet'] ``` The... [16:42:47] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043356 (10elukey) Checked the oxygen logs and the following UA is the only one getting 503s during the past 21 days: ```244268 "Wikipedia/10... [16:47:42] stashbot: help [16:47:43] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [16:48:31] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:50:56] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10Ottomata) Just FYI, there is a Kafka based Monolog implementation in Mediawiki, currently used by the Discovery team for shipping some logs to Hadoop. I betcha we could pretty easily use... [16:52:12] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043475 (10Milimetric) Ping @Fjalapeno this UA is the iOS app, right? Any help you can provide Luca in finding out why we might be seeing 503... [16:52:26] (03CR) 10Ottomata: [C: 032] Symlink reportupdater output to published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric) [16:55:24] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043511 (10elukey) Just adding a note: I am seeing also others similar UA, that follows the same pattern.. but nothing else. I suspect that I... [16:55:56] jouncebot: refresh [16:55:57] I refreshed my knowledge about deployments. [16:56:01] jouncebot: next [16:56:01] In 0 hour(s) and 3 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1700) [16:56:16] Puppet SWAT is empty. We moved my patches to tomorrow morning :) [16:59:29] !log cache_misc, cache_maps: libssl1.1 upgraded to 1.1.0e-1+wmf1, libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1 [16:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1700). Please do the needful. [17:01:49] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3043535 (10Jgreen) a:05Jgreen>03RobH Reassigning to Rob because we're stuck at a hardware problem (new HDDs appear to be incompatible with the controller/BIOS/firmware?) [17:04:10] oohh puppet swat empty? https://i.redd.it/6osjlug3xugy.gif [17:04:15] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3043539 (10RobH) So now the system is in a bad state where I cannot login to the webGUI to upgrade firmware, and its not coming back from it by racreset. in detail: trying to log... [17:05:01] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 19 [17:05:56] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3043555 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet'] ``` and were **ALL** successful. [17:06:03] 06Operations: Expire time on 404 is too high (Wikipedia) - https://phabricator.wikimedia.org/T157214#3043556 (10Aklapper) 05Open>03declined Unfortunately closing this report as no further information has been provided. @Mjbmr: Please reopen this report (by changing its status) after you have provided the inf... [17:07:41] (03PS5) 10Madhuvishy: diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 [17:08:41] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:58] (03CR) 10Madhuvishy: [C: 032] diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 (owner: 10Madhuvishy) [17:10:07] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043573 (10Fjalapeno) [17:10:21] (03PS1) 10Jcrespo: Upcoming mediawiki-core hardware expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) [17:10:42] (03CR) 10Madhuvishy: [V: 032 C: 032] diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 (owner: 10Madhuvishy) [17:12:07] (03CR) 10Jcrespo: [C: 04-2] "This is not intended for commit (but please do not abandon for 1-2 years)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [17:13:14] (03CR) 10jerkins-bot: [V: 04-1] Upcoming mediawiki-core hardware expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [17:13:53] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043580 (10Fjalapeno) @Milimetric having @joewalsh verify this for you [17:13:54] (03PS3) 10Madhuvishy: labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 [17:14:29] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 (owner: 10Madhuvishy) [17:17:56] Amir1: did the ores rolling restart happen? still seeing statsd metrics towards graphite2001 [17:18:30] godog: sorry for late work, I was looking for my yubikey [17:19:30] _joe_: no worries, I'm around for another hour at least [17:19:34] no, that was for Amir1 [17:19:52] _joe_: re https://gerrit.wikimedia.org/r/#/c/336852 I had a comment about using the default ssl vhost too [17:20:03] <_joe_> sorry, meeting [17:20:11] !log restarting ores uwsgi and celery services in scb nodes [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:33] 1 down, 7 to go (doing codfw ones too) [17:20:38] Amir1: neat, thanks! [17:27:00] godog: eqiad nodes are done now [17:27:06] going to codfw nodes [17:34:36] (03PS1) 10Gehel: WIP - elasticsearch: only send minimal logging to console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) [17:35:01] !log done restarting ores services [17:35:04] godog: ^ [17:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:14] Amir1: fantastic, thanks for your help! [17:35:32] Thank you! [17:35:55] !log roll-restart parsoid in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022 [17:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:01] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [17:36:41] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:43:41] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:19] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Reedy) https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.ee If you follow "Add a record to your domain settings (Recommended)", and provide t... [17:46:56] (03PS1) 10Chad: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 [17:47:12] !log roll-restart jmxtrans in codfw/eqiad on conf* to pick up statsd.eqiad.wmnet DNS changes - T157022 [17:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:18] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [17:48:16] (03CR) 10jerkins-bot: [V: 04-1] Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad) [17:49:09] (03PS2) 10Chad: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 [17:49:15] stupid unit test [17:50:36] !log roll-restart ocg in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022 [17:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:54] (03CR) 10Chad: [C: 032] Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad) [17:56:09] (03Merged) 10jenkins-bot: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad) [17:57:17] (03CR) 10jenkins-bot: Multiversion: Don't trigger a PHP warning on non-500 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338999 (owner: 10Chad) [17:57:31] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10CRoslof) One- and two-character .org domain names aren't available for general registration. See, for example, this press release... [17:57:44] !log demon@tin Synchronized multiversion/MWMultiVersion.php: Shut up dumb invalid hostname errors (duration: 00m 52s) [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:19] (03CR) 10Dzahn: "why the change? you don't want it to redirect to the tools directly anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [17:58:37] !log demon@tin Synchronized tests/multiversion/MWMultiVersionTest.php: No op in prod, completeness, etc (duration: 00m 40s) [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T1800). [18:00:06] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10Dzahn) Even if we would be able to get it and wanted to use it, it would still be blocked on T133548. [18:03:12] nothing for ores today [18:03:50] !log roll-restart trendingedits in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022 [18:03:52] (03PS1) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 [18:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [18:04:15] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3043748 (10Reedy) Oh, and if you're wanting to use Google Apps like that.. I suspect your mail server MX records will need updating - https://github.com/wikimedia/operations-dns/b... [18:04:21] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3043749 (10Dzahn) check_ipmi_sensor has been installed across the fleet but doesn't work. running it with options for temperature makes it exit with "CRIT" for _non_-temperature things root@lead:~# /usr/lo... [18:04:41] !log roll-restart eventstreams in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022 [18:04:42] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10demon) Heh, I had this idea like 5 **years** ago but never felt like bothering to follow-up on it. Plus T133548 [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:20] (03CR) 10jerkins-bot: [V: 04-1] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy) [18:05:22] (03PS2) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 [18:06:28] hashar: bouncing zuul on contint1001 isn't impactful is it? [18:06:41] (03CR) 10jerkins-bot: [V: 04-1] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy) [18:06:51] (03PS3) 10Madhuvishy: labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 [18:12:03] 06Operations, 10Traffic, 10Wikimedia-Mailing-lists: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#3043827 (10RobH) p:05Triage>03High a:05RobH>03BBlack I'm just not getting through this fast enough, so I'm reassigning this to B... [18:12:10] !log roll-restart zuul on cont1001 to pick up statsd.eqiad.wmnet DNS changes - T157022 [18:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:15] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [18:12:41] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:12:57] !log roll-restart nodepool on labnodepool1001 to pick up statsd.eqiad.wmnet DNS changes - T157022 [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:24] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(27|32|34|38|41).eqiad.wmnet [18:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3043838 (10Aklapper) @Jseddon / @MBeat33: Any news? [18:18:45] (03PS2) 10Volans: Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) [18:20:08] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3043870 (10RobH) {F5743871} is the zip of the license info. @Papaul: Next time you need me to pull this, please assign it to me so I won't miss it. Please update the license on the system. While this... [18:23:08] (03PS1) 10Volans: Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659) [18:23:11] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3043881 (10RobH) The license should not expire, that is strange. I've downloaded it from Dell's license management site: {F5743908} - iDRAC7 Enterprise,Perpetual,Digital License only Please up... [18:23:23] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043886 (10Fjalapeno) @Milimetric @elukey verified that this is the iOS app [18:23:39] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043892 (10Jgreen) [18:24:17] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043908 (10Jgreen) @EWilfong_WMF are you the right point of contact for Trilogy for this? [18:24:40] 06Operations, 10Graphite: Improve graphite failover - https://phabricator.wikimedia.org/T88997#3043910 (10fgiunchedi) [18:26:02] (03PS1) 10Jcrespo: [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 [18:26:29] (03CR) 10Jcrespo: [C: 04-2] "Not ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo) [18:26:59] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043925 (10RobH) Please note that some potential details for this are also on private task T156849. However, relevant info has been... [18:28:12] 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043933 (10Volans) @mmodell I'm not sure what's the status with the https://phabricator.wikimedia.org/source/keyholder/ repository that was recently created. I'... [18:30:26] 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3043941 (10mmodell) @volans: Thanks for the heads-up. We still use the code from puppet in prod. It will will remain that way until I get the package accepted by... [18:34:21] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:27] (03CR) 10Rush: [C: 031] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy) [18:34:43] (03CR) 10Madhuvishy: [C: 032] labstore: Fix sudo permissions for directory size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339001 (owner: 10Madhuvishy) [18:34:59] !log changeprop deploy 4706f9da [18:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:49] !log ppchelko@tin Started deploy [changeprop/deploy@4706f9d]: Change-Prop: Make ORES return minified responses T157693 [18:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:54] T157693: Use minified JSON format in ChangeProp - https://phabricator.wikimedia.org/T157693 [18:36:44] !log ppchelko@tin Finished deploy [changeprop/deploy@4706f9d]: Change-Prop: Make ORES return minified responses T157693 (duration: 00m 55s) [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:24] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043971 (10JoeWalsh) @Milimetric this UA is from the iOS app. In testing locally, I didn't see any 503s. A potential cause of the surge... [18:45:11] (03PS1) 10Chad: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 [18:45:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo) [18:45:23] (03CR) 10Chad: [C: 04-2] "For l8r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad) [18:45:58] !log installing PHP security updates on iridium (phabricator.wikimedia.org) [18:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:13] (03CR) 10Tim Landscheidt: "@Dzahn: The change should be a no-op for the redirects (at least that is my intention). I just want to use the same syntax for all redire" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [18:49:31] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2866.70 Read Requests/Sec=5072.80 Write Requests/Sec=7.80 KBytes Read/Sec=23726.80 KBytes_Written/Sec=234.00 [18:49:53] !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build [18:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:31] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4671.60 Read Requests/Sec=2119.00 Write Requests/Sec=32.80 KBytes Read/Sec=18994.40 KBytes_Written/Sec=5586.80 [18:56:11] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [18:58:11] (03CR) 10Muehlenhoff: [C: 04-1] "Not sure if that's really desirable. I agree the specific log message is superfluous, but firejail doesn't have a concept of log verbosity" [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [19:01:16] Dangit, scap didn't pick up my testwiki to wmf.13 [19:01:25] So didn't build l10n :( [19:02:21] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:02:31] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=247.50 Read Requests/Sec=306.70 Write Requests/Sec=74.30 KBytes Read/Sec=8127.60 KBytes_Written/Sec=751.60 [19:03:54] - "testwiki": "php-1.29.0-wmf.12", [19:03:55] + "testwiki": "php-1.29.0-wmf.13", [19:03:56] Why not? [19:03:56] stupid scap [19:04:22] (03PS1) 10Madhuvishy: labstore: Change prefix depth and byteunit config for dir size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339006 [19:04:28] RainbowSprinkles: ah crap :( This is https://phabricator.wikimedia.org/T156851 [19:04:45] Ahhh [19:04:50] I forgot that bug [19:04:52] Hmm [19:04:56] (03PS1) 10Rush: wip: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007 [19:05:10] I have a fix. The workaround for now is to run a scap pull on tin before the sync [19:05:12] sorry :( [19:05:30] Ah, I'll let this sync finish first so the files go out everywhere [19:05:32] Then do that [19:05:56] (03CR) 10jerkins-bot: [V: 04-1] wip: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007 (owner: 10Rush) [19:06:02] if you sync again it'll just work [19:06:22] just make sure that /srv/mediawiki/wikiversions.json is correct, but I think it's just an order of operations thing. [19:06:50] * RainbowSprinkles nods [19:06:51] Thx [19:06:54] totally my fault in the 3.5.x release :( [19:07:09] Feels like dejavu ;P [19:07:48] heh, there's some fun scap history on that task. I'm retravelling well worn paths it seems. [19:07:55] (03CR) 10Rush: openstack: nova_fullstack_test changes to daemonize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [19:08:10] (03PS11) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [19:09:48] (03PS12) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [19:10:14] (03CR) 10Madhuvishy: [C: 032] labstore: Change prefix depth and byteunit config for dir size diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/339006 (owner: 10Madhuvishy) [19:14:21] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:14:54] moritzm hi, upstream are doing the systemd file it seems https://gerrit-review.googlesource.com/#/c/89893/ :) [19:15:11] (gerrit) [19:15:19] nice! [19:15:51] yep, will be available in gerrit 2.13.6 according to https://groups.google.com/forum/#!topic/repo-discuss/SL_lXZDDG_g [19:16:08] !log demon@tin Finished scap: prime wmf.13 - testwiki plus l10n build (duration: 26m 15s) [19:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:12] !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build (pt 2 because T156851) [19:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:17] T156851: scap wikiversions compile happening too late in scap sync - https://phabricator.wikimedia.org/T156851 [19:18:11] 06Operations, 10ops-eqiad, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3044110 (10RobH) a:03Cmjohnson [19:18:21] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [19:18:34] thcipriani: Indeed, working now [19:20:01] thcipriani: Heh, completing the sync then doing a second? Results in a bajillion File not found: /srv/mediawiki/php-1.29.0-wmf.13/../wmf-config/ExtensionMessages-1.29.0-wmf.13.php in /srv/mediawiki/wmf-config/CommonSettings.php on line 3416 [19:20:33] moritzm seems the change also auto starts gerrit whats a request on 80 is recived. I think thats if someone goes to gerrit.wikimedia.org and it's down. it will auto start though i may have misread it or something. [19:20:39] mutante ^^ :) [19:20:52] "and a corresponding gerrit.service file enables an automatic start of gerrit [19:20:52] on the first request on port 80." [19:21:05] Well that seems like bizarre behavior [19:21:12] Why would you want it to be off until port 80 is hit? [19:21:13] :) [19:21:41] Or, if you want to take it down for maintenance, have it come up automatically because someone tries hitting it :) [19:21:53] (Also, their systemd file looks mostly useless for us, we don't serve over :80 [19:22:13] We serve over 8080 which is proxied to 443 for users [19:23:36] RainbowSprinkles not sure. [19:23:59] we will want to customise the file since well they are putting it on this path /opt/gerritsrv/ [19:24:07] which isen't on /var/lib/gerrit2. [19:24:19] We can just write our own :) [19:24:25] Already done that [19:24:29] There's no point in using theirs, it's a dumb stub :) [19:24:37] I know, we should just keep our own [19:24:40] RainbowSprinkles https://gerrit.wikimedia.org/r/#/c/333475/ [19:24:41] ok [19:25:11] that one works better in my testing then the init.d one. [19:25:44] It's still ultimately the same script :) [19:25:53] Our init.d was just copied from ./bin/gerrit.sh [19:25:54] :) [19:26:10] Yep, but just never worked when you ran the script as root [19:26:19] i mean sudo service gerrit start or stop [19:26:23] It worked in prod ;-) [19:26:29] Otherwise I would've been worried [19:26:34] oh [19:27:10] how did you manage to get it working? Is the pid run as root? The pid for me keeps getting set as root. but with systemd it is gerrit2. [19:27:35] It's gerrit2 [19:28:07] yep [19:32:13] !log demon@tin scap failed: RuntimeError 2 test canaries had check failures (rerun with --force to override this check) (duration: 15m 00s) [19:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:44] !log demon@tin Started scap: prime wmf.13 - testwiki plus l10n build (pt 3 because ugh) [19:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:58] RainbowSprinkles and moritizm i will just try to upstream the version i created of systemd for gerrit. [19:33:06] thcipriani: Heh, upside is that canary checks did their job. Downside is I'm trying to *fix* that breakage ;-) [19:33:14] They can have two version of the file as you can write it in different ways in systemd :) [19:33:22] RainbowSprinkles: :( [19:36:48] wtf...? [19:36:59] /wiki/Special:Version MWException from line 481 of /srv/mediawiki/php-1.29.0-wmf.13/includes/cache/localisation/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [19:37:57] ugh, what is happening? Is scap not rebuilding l10n? [19:38:18] Claimed it did [19:39:20] Lets let the current scap finish so the can't find extensionmessages bit goes away [19:39:24] Then we'll figure out why no en language [19:39:56] didn't rebuild the cdbs... [19:40:08] (03CR) 10Volans: [C: 04-1] "one leftover from debugging, see inline" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [19:40:17] well, at least so far, just have the upstream/*.{json,md5} files [19:40:50] Ah, I figured it out [19:40:53] should do that as the last step of a scap, I guess... [19:41:00] scap-rebuild-cdbs didn't do anything the first time [19:41:01] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:41:04] Because of the bug [19:41:28] ah, right, since it is not an active wikiversion the first time around. [19:41:30] But we've swapped to wmf.13, so testwiki is expecting the cache files [19:41:41] It *should* clean up as this scap finishes [19:42:16] thcipriani: This bug is nasty :( [19:42:24] hrm...update wikiversions should double-check the cdb files as well. I know we do that in several places. [19:43:43] indeed. it's ugly. and it's a tangly mess. [19:46:35] RainbowSprinkles did you notice the extra emails gerrit sends out now. [19:46:40] No [19:46:54] thcipriani: scap-rebuild-cdbs is fixing it this time [19:46:58] Error rate going down [19:47:17] Well i have a ton even from changes that have not changed in a while. Makes it hard to look at newer changes needing reviews. [19:47:32] But upstream have annoleged the bug and have a fix for the rest api [19:47:41] but not for the ui, gwt and polygerrit [19:48:02] I disable most e-mails anyway :) [19:48:06] oh [19:48:15] the bug only happened in 2.13. [19:48:36] 2.13 is a terrible release! [19:48:57] Yeh, 2.14 will be a great release :) [19:49:10] Hopefully less buggy [19:49:11] We'll wait until at least 2.14.1 or .2 [19:49:14] I don't trust .0 [19:49:15] :D [19:49:18] yep [19:49:32] RainbowSprinkles: You will love it. So awesome. [19:49:42] Maybe gerrit will be great again [19:49:47] lol [19:49:48] Like 2.8.forever [19:49:53] 2.8.x was a great release [19:49:55] The last good release [19:49:56] :p [19:50:02] !log demon@tin Finished scap: prime wmf.13 - testwiki plus l10n build (pt 3 because ugh) (duration: 17m 17s) [19:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:14] RainbowSprinkles what did you like about 2.8? [19:50:19] It was stable [19:50:24] So. Freaking. Stable. [19:50:28] oh ah. [19:50:35] Bugs? Sure. But they were UI quirks and minor things [19:51:11] RainbowSprinkles i think everything will be unstable as upstream are going with NoteDb and doint allow access to rest api [19:51:34] so no one really expererence the bugs there so that may be why everything is buggy. [19:52:06] thcipriani: All better now. [19:52:08] RainbowSprinkles my patch for allowing owners to delete there own changes was merged :) [19:52:09] That was...annoying [19:52:45] agreed. will have a fix whenever I can get testing to work. [19:52:57] The component there building polygerrit on is available for gwt. Polymer for gwt. [19:53:40] Reedy the 2.13.6 release will help us prevent problems like https://phabricator.wikimedia.org/T153079 [19:53:51] includes a couple of fixes for submodules :) [20:00:05] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T2000). Please do the needful. [20:00:40] (03PS1) 10Gergő Tisza: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 [20:01:51] jouncebot: get with the program, I already am [20:03:53] (03PS1) 10Madhuvishy: labstore: Log directory size collector size in bytes [puppet] - 10https://gerrit.wikimedia.org/r/339013 [20:04:16] (03CR) 10Chad: [C: 032] group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad) [20:07:23] (03Merged) 10jenkins-bot: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad) [20:07:31] (03CR) 10jenkins-bot: group0 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339005 (owner: 10Chad) [20:08:18] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.13 [20:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:28] cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425? [20:08:39] cmjohnson1: anything would useful to plan by [20:08:53] cmjohnson1: including "when hell freezes over" [20:08:54] :) [20:09:01] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:03] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Log directory size collector size in bytes [puppet] - 10https://gerrit.wikimedia.org/r/339013 (owner: 10Madhuvishy) [20:09:08] (03PS1) 10BBlack: LE: allow non-root key ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/339015 (https://phabricator.wikimedia.org/T154917) [20:09:10] (03PS1) 10BBlack: lists: use LE cert for exim [puppet] - 10https://gerrit.wikimedia.org/r/339016 (https://phabricator.wikimedia.org/T154917) [20:09:14] urandom: Between now and the heat death of the universe? ;-) [20:09:49] RainbowSprinkles: sure, sure [20:10:01] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:10:07] though, you know, anything to narrow that down... [20:11:12] thcipriani: group0 appears quiet, minus that spike due to cache fun times. [20:11:18] So, success? [20:11:45] (just the usual culprit of redis) [20:12:49] well. A success in terms of the new version anyway. [20:12:59] (new version of mediawiki) [20:14:24] (03PS1) 10Gehel: elasticsearch - reimage elastic10(35|39|43|44) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339017 (https://phabricator.wikimedia.org/T151326) [20:15:49] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(35|39|43|44).eqiad.wmnet [20:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:19] (03CR) 10Andrew Bogott: [C: 031] "Looking forward to seeing this in action." [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [20:17:21] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044385 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1035.eqiad.wmnet'] ``` The... [20:17:22] (03CR) 10Paladox: [C: 031] "We can now do this as the fix for ipv6 was rolled out on our install." [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [20:17:30] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044386 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1039.eqiad.wmnet'] ``` The... [20:17:37] (03PS2) 10Paladox: phabricator: enable vcs and web user to run `git` and `ssh` via sudo [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [20:17:45] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044387 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1044.eqiad.wmnet'] ``` The... [20:17:57] (03CR) 10Paladox: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [20:18:01] (03PS1) 10Thcipriani: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 [20:18:03] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044388 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1043.eqiad.wmnet'] ``` The... [20:19:11] (03CR) 10Paladox: [C: 031] "@Dzahn this is needed, otherwise the file doesn't not get correctly created. IE some erb syntax is left in the file. Tested this fix local" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [20:20:41] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:22:41] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3044424 (10EWilfong_WMF) @Jgreen Yes, I will be the point of contact for this update. This domain is hosted using Azure's App Servic... [20:24:21] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:25:09] (03PS1) 10Dzahn: install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020) [20:26:37] (03PS2) 10Dzahn: install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020) [20:28:21] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [20:31:08] (03CR) 10Dzahn: [C: 032] install: allow rsync of /home from carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/339019 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn) [20:31:32] (03PS1) 10Dzahn: install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020) [20:36:33] !log rsyncing /home/ dirs excl. dot files, from carbon to install1002 (T158020) [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:39] T158020: decom carbon - https://phabricator.wikimedia.org/T158020 [20:37:01] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:37:41] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:39:16] (03PS1) 10Dzahn: install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020) [20:39:19] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044511 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1043.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1043.eqi... [20:42:15] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044520 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1039.eqiad.wmnet'] ``` and were **ALL** successful. [20:44:27] !log carbon - backup /root data to install1002:/root/root-carbon/ before shutdown (T158020) [20:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:31] T158020: decom carbon - https://phabricator.wikimedia.org/T158020 [20:45:19] (03CR) 10Chad: [C: 032] scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani) [20:45:45] (03PS2) 10Dzahn: install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020) [20:47:10] (03Merged) 10jenkins-bot: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani) [20:47:26] (03CR) 10jenkins-bot: scap prep: fix subprocess calls for master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339018 (owner: 10Thcipriani) [20:47:40] !log (terbium) sql --write testwiki 'DELETE FROM module_deps' (per T158105) [20:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:44] T158105: "PHP Warning: filemtime(): No such file or directory" about files removed over a year ago - https://phabricator.wikimedia.org/T158105 [20:48:22] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044539 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1044.eqiad.wmnet'] ``` and were **ALL** successful. [20:48:37] !log (terbium) sql --write test2wiki 'DELETE FROM module_deps' (3687 rows affected, 0.01 sec) - per T158105. [20:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:11] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3044542 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1035.eqiad.wmnet'] ``` and were **ALL** successful. [20:49:53] (03CR) 10Dzahn: [C: 032] install: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339021 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn) [20:50:31] (03PS2) 10Dzahn: install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020) [20:51:41] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:52:16] 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044558 (10Dzahn) [20:59:57] (03CR) 10Dzahn: [C: 032] install: set install1002 as primary install again [puppet] - 10https://gerrit.wikimedia.org/r/339023 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn) [21:03:08] (03CR) 10Dzahn: "i don't think it's a no-op for the redirects. before you get a redirect to $1 so each tool, after you redirect everything to the overview " [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [21:06:19] (03CR) 10Tim Landscheidt: "JFTR: Didn't look further into my claim of a failure on Trusty; it works where it should work, and if it does not somewhere that would mak" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [21:08:06] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:10:04] !log demon@tin Synchronized scap/plugins/prep.py: Completeness (duration: 00m 42s) [21:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:09] thcipriani: You're live ^ [21:10:43] RainbowSprinkles: whoa boy. Thanks :) [21:13:29] (03CR) 10Tim Landscheidt: "(I meant "no-op" = "no change to the previous behaviour", but I think we both do :-).)" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [21:14:01] (03PS1) 10Chad: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 [21:20:49] (03CR) 10Mobrovac: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [21:21:04] (03PS3) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) [21:21:06] (03PS1) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 [21:21:23] (03PS2) 10Papaul: Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) [21:22:22] (03PS2) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 [21:22:24] (03PS4) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) [21:22:45] 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3044646 (10Papaul) [21:24:32] (03CR) 10Mobrovac: [C: 031] "PCC looks good as expected - https://puppet-compiler.wmflabs.org/5519/" [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [21:26:43] (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [21:30:39] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=321.70 Read Requests/Sec=1563.30 Write Requests/Sec=6.10 KBytes Read/Sec=31028.80 KBytes_Written/Sec=130.40 [21:30:40] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044667 (10Pchelolo) So, we've discussed this on the team meeting and decided to move forward on this. The final question is whether to use syslog-over-udp or normal file logging? We... [21:31:26] !log carbon - puppet node clean, node deactivate (T158020) [21:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:33] T158020: decom carbon - https://phabricator.wikimedia.org/T158020 [21:33:44] (03PS1) 10Chad: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 [21:35:57] urandom: I was waiting on the spare disks that did arrive. I will swap it out in the morning [21:36:19] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:36:23] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044690 (10GWicke) @Pchelolo, logging directly to a file is synchronous, which is bad for performance & can cause outages. See the earlier discussion for an example of such an outage.... [21:36:43] cmjohnson1: awesome; thanks! [21:39:45] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3044700 (10Eevans) To summarize from IRC today: ```lang=irc 15:08 < urandom> cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425? ... 16:35 < cmjohnson1> uran... [21:40:22] (03PS1) 10Gergő Tisza: Fix PageViewInfo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339041 (https://phabricator.wikimedia.org/T158698) [21:43:13] (03PS3) 10Gergő Tisza: Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) [21:43:39] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=116.70 Read Requests/Sec=194.00 Write Requests/Sec=6.30 KBytes Read/Sec=1991.60 KBytes_Written/Sec=297.20 [21:45:03] 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044742 (10Dzahn) [21:49:20] (03CR) 10BryanDavis: [C: 04-1] "A couple of small nits inline." (032 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [21:50:57] (03PS4) 10Zppix: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 [21:51:11] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044773 (10Pchelolo) >>! In T112648#3044690, @GWicke wrote: > @Pchelolo, logging directly to a file is synchronous, which is bad for performance & can cause outages. See the earlier d... [21:51:16] (03PS5) 10Zppix: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 [21:53:39] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:54:07] (03PS13) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [21:55:03] (03CR) 10Rush: [V: 032 C: 032] openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [21:56:21] (03CR) 10Zppix: [C: 031] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [21:56:52] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3044783 (10Eevans) [21:57:55] (03PS8) 10Gergő Tisza: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [21:58:28] (03CR) 10Dzahn: [C: 031] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [21:58:40] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3044786 (10GWicke) @Pchelolo, in the disk-full situation, writing directly to files would still cause memory to build up & the service to run out of memory. [21:59:48] (03PS1) 10Dzahn: remove carbon's production IPs [dns] - 10https://gerrit.wikimedia.org/r/339063 (https://phabricator.wikimedia.org/T158020) [22:01:22] !log carbon - removed from icinga, shutdown -h now (T158020) [22:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:28] T158020: decom carbon - https://phabricator.wikimedia.org/T158020 [22:01:41] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:03:09] (03CR) 10Dzahn: [C: 032] remove carbon's production IPs [dns] - 10https://gerrit.wikimedia.org/r/339063 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn) [22:04:25] Hi, im seeing a restricted panel at https://phabricator.wikimedia.org [22:04:26] Missing or Restricted Panel [22:04:27] This panel does not exist, or you do not have permission to see it. [22:04:31] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:04:33] What is the panel? [22:04:41] twentyafterfour ^^ [22:04:42] paladox: what dashboard you use [22:04:45] Cause mine is fine [22:04:49] Zppix im using the default [22:04:54] i haven't changed mine. [22:05:00] Try using my dashboard it works fine [22:05:13] Oh, nope i like the default one :) [22:05:14] ? [22:05:24] I will take a screenshot [22:05:27] paladox: i meant to test your permission [22:06:50] Zppix twentyafterfour https://phabricator.wikimedia.org/F5747915 [22:06:57] it is near to the bottom [22:06:59] 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3044796 (10Dzahn) [22:07:09] 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3023767 (10Dzahn) [22:07:15] Under Activity Feed [22:07:23] Have you tried relogginf [22:07:29] Relogging* [22:08:24] (03PS9) 10Gergő Tisza: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [22:08:29] Zppix that wont help as one of the admins changed the dashbored. [22:09:07] (03CR) 10Gergő Tisza: [C: 031] Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [22:09:25] I cant see it im only a user [22:09:39] So maybe twentyafterfour could [22:09:45] Zppix if your using a different dashbored to me, you wont be able to see it. [22:10:07] I switched to default to look at it [22:10:11] Same issue [22:10:11] 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3044807 (10Dzahn) a:05Dzahn>03RobH Hi @Robh see the check boxes above. could you disable the switch port and then hand over? Thanks! [22:10:19] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3044810 (10Eevans) > Another problem that we have in repository management is the problem that a component can only contain one version of a binary package. That's problematic for long-term migrations, e.g. wh... [22:11:58] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044819 (10Dzahn) [22:12:10] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2076659 (10Dzahn) carbon is down: count: 5 [22:12:22] I can't remove it [22:12:26] it's not a real panel [22:12:28] it's a glitch [22:13:27] Well thats good [22:13:36] Is it a cache issue [22:13:45] (03PS1) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 [22:14:32] (03PS2) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 [22:14:57] (03CR) 10BryanDavis: [C: 032] Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [22:15:35] twentyafterfour oh [22:15:40] thanks for fixing it [22:16:09] (03Merged) 10jenkins-bot: Update the realname from github repo url --> WikiTech [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [22:17:08] bd808: thanks now the github url will no longer haunt me [22:17:36] (03CR) 10Thcipriani: [C: 04-1] "I have questions about cleaning l10nupdate files in the staging directories." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [22:17:39] that's a pretty bizarre thing to be haunted by ;) [22:17:42] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044841 (10Dzahn) [22:18:02] * bd808 is scared of nuclear winter and fast moving zombies [22:18:09] bd808: well i hate github with a passion hence is why ive merged every tool ive made to gerrit [22:18:26] (03PS4) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [22:18:32] huh. I actually really like github's product [22:18:42] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2088494 (10Dzahn) [22:18:59] Zppix how can you hate github? gerrit and github are good including differential. [22:19:05] differential and diffusion [22:19:05] I like the idea but the execution is crap... anyway thats not why were here [22:19:20] paladox: just gerrit has friendly ui in my opinion [22:19:42] Zppix oh. Well it's about to get freindler to mobile users. [22:19:50] !bash < Zppix> paladox: just gerrit has friendly ui in my opinion [22:19:50] bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AVpiwmEwQMK9DA-FKimk [22:20:11] Lol why was that bashed lol [22:20:18] https://en.wikipedia.org/wiki/Stockholm_syndrome [22:20:30] bd808 what about polygerrit? That is way more freindler on desktop screens and mobiles. [22:20:50] the day I care about code review on mobile... [22:20:52] paladox: what i want is gerrit app for wmf [22:21:19] lol /me is afraid of gerrit [22:21:43] twentyafterfour: then dont join #wikimedia-releng [22:21:44] Zppix theres already a gerrit app for android but not wmf branded. [22:21:47] code review on mobile seems pretty weird to me [22:21:49] https://play.google.com/store/apps/details?id=com.ruesga.rview&hl=en_GB [22:22:00] paladox: sorry unless ios is the new android :/ [22:22:08] Zppix i never use android [22:22:12] * paladox hates android [22:22:15] lol [22:22:17] I love my iphone [22:22:26] so many opinions, so little time... [22:22:59] Zppix me two + the other one i had. [22:23:18] https://panic.com/prompt/ <-- use that and run gerrit via git cli [22:23:24] twentyafterfour: atleast its not always 4:20 [22:24:04] No i like to avoid shell as much as possible [22:24:05] it's always 4:20 somewhere at least once an hour... [22:24:23] I cant tell you how many tabs i use just so i can use shell [22:24:25] twentyafterfour lol, but when you in the car you could write something like shutdown or misspell [22:24:50] paladox: meh eqiad is not an important datacenter anyway :P [22:25:14] (03PS1) 10BryanDavis: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 [22:25:16] (03PS1) 10BryanDavis: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 [22:25:18] (03PS1) 10BryanDavis: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 [22:25:22] sudo + touch screen keyboard = disaster :D [22:26:25] twentyafterfour i payed £20 for https://itunes.apple.com/gb/app/alwaysonpc-firefox-flash-player/id324618793?mt=8 [22:26:29] Zppix lol yes it is [22:26:37] wikipedia will crash if eqiad goes down. [22:26:40] Wait so rm -f * in the puppetmaster for eqiad isnt what you wanted twentyafterfour [22:26:46] paladox: ik i was joking [22:27:11] lol [22:27:14] paladox: you got ripped off, it's £8.99 now [22:27:25] Yeh i know, and i doint even use the thing [22:27:30] any more [22:27:31] I just use safari [22:27:37] i only get 2gb of storage. [22:27:39] * Zppix puts on sunglasses [22:27:40] paladox: we can survive eqiad going away but it won't be instant recovery [22:27:45] oh [22:28:03] Doesnt the texas datacenter have the backup of eqiad? [22:28:22] Zppix: yes, everything is duplicated in codfw [22:28:26] twentyafterfour it will probaly be £20 again in 2 years any ways. [22:28:33] but I think it would take an hour or two to recover [22:28:39] yeh [22:29:15] 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044859 (10Dzahn) [22:29:34] (03CR) 10jerkins-bot: [V: 04-1] nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 (owner: 10Rush) [22:30:04] MaxSem and Pchelolo: Respected human, time to deploy Kartotherian update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170221T2230). Please do the needful. [22:30:12] twentyafterfour: you would probably need more than this operations team to recover it [22:31:07] (03PS1) 10Tim Landscheidt: ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069 [22:31:32] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:31:57] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3044883 (10Dzahn) [22:32:46] (03CR) 10Hashar: [C: 032] Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis) [22:32:47] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2094545 (10Dzahn) [22:32:55] (03CR) 10Tim Landscheidt: "Diff: diff -u <(git show production:modules/ganglia/lib/puppet/parser/functions/suffix.rb) modules/stdlib/lib/puppet/parser/functions/suff" [puppet] - 10https://gerrit.wikimedia.org/r/339069 (owner: 10Tim Landscheidt) [22:33:38] (03CR) 10jerkins-bot: [V: 04-1] Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis) [22:33:49] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2095303 (10Dzahn) [22:33:54] (03CR) 10jerkins-bot: [V: 04-1] Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis) [22:34:07] (03CR) 10jerkins-bot: [V: 04-1] Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis) [22:34:25] twentyafterfour i see now that polymer is in gwt. I do not get it why upstream will not do that. Polygerrit is all js, whereas gwt was implemented as java. [22:35:07] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044890 (10Dzahn) @Zppix This ticket is for precise in production (i adjusted ticket title to clarify). For precise in labs please use T143349. [22:36:01] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044896 (10Dzahn) [22:37:02] (03CR) 10Tim Landscheidt: "(In fact some of the usages of $kafka_config['brokers']['array'] might now be replaceable by suffix($kafka_config['brokers'], '') because " [puppet] - 10https://gerrit.wikimedia.org/r/339069 (owner: 10Tim Landscheidt) [22:37:10] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3044897 (10Zppix) @Dzahn Ah once i added the subtask for here i then started second guessing that, thanks for confirming my doubt. [22:37:30] (03PS2) 10BryanDavis: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 [22:37:32] (03PS2) 10BryanDavis: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 [22:37:34] (03PS2) 10BryanDavis: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 [22:38:09] (03PS5) 10Zppix: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt) [22:39:19] (03CR) 10Hashar: "Hints about py3 support from pywikibot/core :)" (032 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis) [22:39:26] (03Abandoned) 10BryanDavis: Ignore lighttpd-precise in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [22:39:52] bd808: hi! various python version is a bit of a mess. Luckily pywikibot people figured it out ! [22:40:00] bd808: some tips and tricks at https://gerrit.wikimedia.org/r/#/c/339065/1/tox.ini :} [22:40:28] bd808: you can also use just "py3" and tox will run whatever version "python3" is [22:41:51] (03PS2) 10Tim Landscheidt: Tools: Outfactor jobkill script to toollabs::node::all [puppet] - 10https://gerrit.wikimedia.org/r/335755 [22:42:03] (03CR) 10Hashar: [C: 031] "Looks good. See my note about flake8 being run with python2. Might want to add another env running flake8 with python3. That is what pyw" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis) [22:43:18] (03PS3) 10Dzahn: Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul) [22:43:48] (03CR) 10Dzahn: [V: 031 C: 031] Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul) [22:43:51] (03CR) 10Dzahn: [V: 031 C: 032] Add mgmt and production DNS for ms-be2028-ms-be2039 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) (owner: 10Papaul) [22:43:57] (03CR) 10Chad: clean.py: Fix up l10nupdate-owned files on masters (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [22:47:13] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044907 (10Zppix) With the information @CRoslof provided I'm going to consider this task denied? Anyone disagree? [22:48:28] 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044908 (10RobH) [22:48:44] (03PS1) 10BryanDavis: Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 [22:49:13] (03CR) 10BryanDavis: [C: 032] "Flake8 on py3 followup in Ie8abe9934b2afe59333a238b1d01a38f118d6e93" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis) [22:49:58] 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3023767 (10RobH) a:05RobH>03Dzahn Assigning back to Daniel pending his approval to wipe the disks. (I imagine this approval will follow once we have a waiting period and no one realizes they missed anything.) When this is... [22:50:33] (03Merged) 10jenkins-bot: Python3 compat [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339065 (owner: 10BryanDavis) [22:50:52] jouncebot: next [22:50:52] In 1 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000) [22:52:21] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:53] 06Operations, 10ops-eqiad: decom carbon - https://phabricator.wikimedia.org/T158020#3044912 (10RobH) a:05Dzahn>03Cmjohnson So this is now ready for wipe. After discussion with Daniel, we have multiple backup options of this data, but we'll put a last chance date of March 1st. Chris: Please do not wipe th... [22:54:10] 06Operations, 10ops-eqiad, 10hardware-requests: decom carbon - https://phabricator.wikimedia.org/T158020#3044915 (10RobH) [22:55:05] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044918 (10Matthewrbowker) >>! In T158490#3042854, @Zppix wrote: >>>! In T158490#3039567, @Matthewrbowker wrote: >>>>! In T158490#3039326, @Z... [22:55:13] (03CR) 10Hashar: [C: 032] "I can tell tell you are smarter than me :}" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 (owner: 10BryanDavis) [22:56:04] bd808: the IB3 library change, it is too late to properly review one. My understanding is you created a new lib that extract the useful bits from jouncebot . That is great [22:56:33] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044919 (10MaxSem) 05Open>03declined [22:56:42] hashar: I'm going to test it live (old school) as soon as the venv update finishes [22:56:54] I switched stashbot to it earlier today with no issues [22:57:40] (03CR) 10BryanDavis: "Testing via cherry-pick to tool" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis) [22:58:32] paladox: bout that panel issue maybe filling a ticket could help [22:58:42] Zppix it's fixed now [22:58:43] (03CR) 10Thcipriani: [C: 031] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad) [22:58:49] twentyafterfour fixed it :) [22:58:55] Oh what was the issue paladox [22:59:39] [22:12:23] I can't remove it [22:59:40] [22:12:27] it's not a real panel [22:59:40] [22:12:28] it's a glitch [22:59:44] Zppix ^^ [22:59:54] Ok but what caused it :D [23:00:00] bd808: yeah that is all super smart. Kudos! [23:00:07] Zppix not sure. [23:00:10] (03CR) 10Hashar: [C: 031] Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis) [23:00:36] bd808: don't wait for my reviews :} I am sleeping now! [23:00:39] (03CR) 10Zppix: [C: 031] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad) [23:00:50] (03CR) 10Thcipriani: [C: 031] "nitpick inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad) [23:02:06] (03CR) 10Thcipriani: "comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [23:04:01] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:04:12] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [23:06:28] andrewbogott: ^? [23:07:10] (03CR) 10Zppix: [C: 031] Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 (owner: 10BryanDavis) [23:07:15] 06Operations, 10ops-codfw, 10netops: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3044966 (10Papaul) [23:07:17] chasemp: no idea, I'll look [23:09:25] (03CR) 10Thcipriani: [C: 04-1] clean.py: Fix up l10nupdate-owned files on masters (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [23:11:44] (03CR) 10Dzahn: "you are correct and i also like the consistency better, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [23:11:55] (03CR) 10Dzahn: [C: 032] toolserver_legacy: Use Redirect instead of RedirectMatch [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [23:12:04] (03PS2) 10Dzahn: toolserver_legacy: Use Redirect instead of RedirectMatch [puppet] - 10https://gerrit.wikimedia.org/r/338610 (owner: 10Tim Landscheidt) [23:14:09] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044987 (10Dzahn) Having multiple URLs for the same content is also bad for "SEO" and we already have w.wiki as a generic URL shortener. [23:15:11] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:15:57] (03CR) 10Chad: [C: 032] clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad) [23:17:33] (03Merged) 10jenkins-bot: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad) [23:17:42] (03CR) 10jenkins-bot: clean.py: Remove useless underscore from method name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339029 (owner: 10Chad) [23:19:29] (03PS3) 10Chad: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 [23:19:39] thcipriani: Renamed do_stuff() to execute_remote() [23:19:40] :p [23:19:51] But do stuff is better RainbowSprinkles [23:20:08] do_stuff() [23:20:12] really_do_stuff() [23:20:16] do_stuff_2() [23:20:17] :) [23:20:29] Insert_the_code_automatically() [23:20:51] ^ thats a lifesaver [23:21:22] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [23:21:34] !log demon@tin Started scap: scap/plugins/clean.py Code cleanup [23:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:45] !log demon@tin scap aborted: scap/plugins/clean.py Code cleanup (duration: 00m 10s) [23:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:56] RainbowSprinkles: thanks :) [23:22:00] Whoops didn't mean a full scap [23:22:20] It would help :P [23:22:57] !log demon@tin Synchronized scap/plugins/clean.py: Code cleanup (duration: 00m 46s) [23:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:17] (03PS3) 10Gergő Tisza: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) [23:30:06] !log Kartotherian deploy did not happen [23:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:01] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [23:34:57] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045037 (10GWicke) [23:43:15] jouncebot: next [23:43:15] In 0 hour(s) and 16 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000) [23:43:19] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045070 (10GWicke) >>! In T66214#2981357, @Gilles wrote: > Something that's missing in the current plan, however, is the swift sharding information tha... [23:43:39] that was a "fun" bug in jouncebot :/ [23:43:44] Good job bd808 talk about last min [23:44:21] its running from my laptop at the moment. I'll get things in gerrit after SWAT is done [23:44:30] Oh gh [23:44:37] Is it that bad :P [23:49:07] (03PS3) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 [23:50:39] jouncebot: next [23:50:39] In 0 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000) [23:54:37] 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3045079 (10RobH) [23:54:39] 06Operations, 10ops-codfw, 10netops: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3045077 (10RobH) 05Open>03Resolved all ports have been enabled, had descriptions set, and placed in the private vlan for their respective rows. [23:54:51] 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3033850 (10RobH) [23:54:54] (03PS2) 10Andrew Bogott: WIP: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [23:55:00] Warning: Cannot modify header information - headers already sent in /srv/mediawiki/php-1.29.0-wmf.12/includes/GlobalFunctions.php on line 1791 1 [23:55:00] Warning: Cannot modify header information - headers already sent in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/HttpStatus.php on line 111 [23:55:08] thcipriani: I should move wfGetCaller() one level further up [23:55:38] Also, recording the header() we're sending is probably a good idea [23:56:24] (03CR) 1020after4: [C: 031] clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad) [23:56:31] haven't dug into the logs on that one, not seeing anything good? [23:56:45] Nothing useful yet [23:56:54] The "already sent in" bit is just a wrapper for the caller itself [23:57:07] So should get the next highest caller [23:58:02] Eh, those errors aren't the same cuz they're not using WebResponse::header() [23:58:08] This is going to be cat and mouse :( [23:59:00] yeah, gonna be tricky to track down :\ [23:59:03] RainbowSprinkles: Tim has a pending patch to do that, if you can wait with the debugging until next week [23:59:11] or want to merge & backport [23:59:20] Link? We can backport easy [23:59:50] https://gerrit.wikimedia.org/r/#/c/338705/ [23:59:57] (03CR) 10Dzahn: [V: 031 C: 031] "http://puppet-compiler.wmflabs.org/5520/" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4)