[02:07:45] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:07:45] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:08:35] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [02:08:35] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [02:08:35] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [02:08:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [02:08:35] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [02:08:35] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [02:08:35] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [02:20:59] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 18s) [02:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 22 02:26:59 UTC 2017 (duration 6m 0s) [02:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:08:45] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:09:05] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:09:05] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [03:09:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:09:35] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [03:09:55] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [03:09:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:26:45] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:26:45] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:26:45] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:26:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:26:45] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:26:45] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:27:45] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200): /api (open graph via native scraper) timed out before a response was received [03:29:35] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [03:29:35] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [03:29:35] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [03:29:35] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [03:29:35] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [03:29:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [03:29:36] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [04:09:46] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4149.40 Read Requests/Sec=2780.70 Write Requests/Sec=552.20 KBytes Read/Sec=18886.80 KBytes_Written/Sec=12506.80 [04:17:45] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=6.80 Read Requests/Sec=0.30 Write Requests/Sec=0.50 KBytes Read/Sec=2.00 KBytes_Written/Sec=9.20 [05:12:55] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:45] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 79375 bytes in 0.323 second response time [05:36:45] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3283370 (10Papaul) [06:00:29] !log smalyshev@tin Started deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228 [06:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:39] T165228: Query results are downloaded in wrong encoding - https://phabricator.wikimedia.org/T165228 [06:02:19] !log smalyshev@tin Finished deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228 (duration: 01m 50s) [06:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:06] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3283373 (10Marostegui) 05Open>03Resolved All good! Thank you! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK) ``` [06:21:34] (03CR) 10Marostegui: [C: 031] "This looks good: https://puppet-compiler.wmflabs.org/6491/" [puppet] - 10https://gerrit.wikimedia.org/r/354960 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [06:22:23] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3281974 (10Marostegui) Hi! Thanks for the patch to realm.pp - it looks good (I have commented on the gerrit patch). From our side I believe we only need to merg... [06:24:53] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283380 (10Dereckson) Thanks for the quick review. Yes, you can merge the change, I've a deployment access, not ops access, so I can't merge it. [06:33:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355064 (https://phabricator.wikimedia.org/T162611) [06:34:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355064 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:36:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355064 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:36:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355064 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:37:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 - T162611 (duration: 00m 39s) [06:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:38] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:42:58] (03PS1) 10Marostegui: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355065 (https://phabricator.wikimedia.org/T162611) [06:45:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355065 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:46:27] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355065 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:46:33] (03CR) 10jenkins-bot: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355065 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:47:22] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2035 - T162611 (duration: 00m 38s) [06:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:32] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:47:35] !log Deploy alter table on db2035 and db1036 for s2. bgwiktionary,eowiki, idwiki - T162611 [06:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:42] !log Deploy alter table s7.frwiktionary on db2029 (codfw master) - https://phabricator.wikimedia.org/T165743 [06:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:45] !log Deploy alter table s7.frwiktionary on dbstore1001 - https://phabricator.wikimedia.org/T165743 [06:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:24] !log Run CleanDuplicateScores script to clean up possible duplicates on wikidatawiki before starting to create the UNIQUE keys - T164530 [07:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:32] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:14:35] !log Deploy alter table on s5 wikidatawiki.ores_classification directly on codfw master - T164530 [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:44] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:22:03] !log installing openjdk-7 security updates on jessie [07:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:15] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [07:33:42] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283446 (10Florian) [07:35:24] (03CR) 10Florianschmidtwelzow: [C: 04-1] Apache: add techconduct.wm.o to remnant sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [07:37:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [07:46:42] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355068 [07:46:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355069 [07:50:01] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355069 [07:51:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355069 (owner: 10Marostegui) [07:52:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355069 (owner: 10Marostegui) [07:53:08] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355068 (owner: 10Marostegui) [07:53:11] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355068 [07:53:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355069 (owner: 10Marostegui) [07:54:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1036 - T162611 (duration: 00m 39s) [07:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:11] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [07:56:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db2035 - T162611 (duration: 00m 38s) [07:56:14] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355068 (owner: 10Marostegui) [07:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:57] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2035 - T162611 (duration: 00m 38s) [07:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355071 (https://phabricator.wikimedia.org/T162611) [07:59:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355071 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [08:01:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355071 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [08:01:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355071 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [08:02:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1021 - T162611 (duration: 00m 38s) [08:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:12] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [08:02:19] !log Deploy alter table on s2 (revision table) db1021 - https://phabricator.wikimedia.org/T162611 [08:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:52] (03CR) 10DCausse: [C: 031] logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [08:17:57] (03CR) 10DCausse: [C: 031] logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [08:32:45] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283487 (10jcrespo) Marostegui, you can deploy and restart if you want, Dereckson cannot do that. [08:41:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355072 (https://phabricator.wikimedia.org/T164530) [08:42:01] (03CR) 10Marostegui: [C: 032] Don't replicate techconductwiki to labs [puppet] - 10https://gerrit.wikimedia.org/r/354960 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [08:42:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355072 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [08:43:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355072 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [08:46:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355072 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [08:46:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1026 - T164530 (duration: 00m 38s) [08:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:55] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [08:47:23] 06Operations, 07HHVM, 07Upstream: HHVM: Crash in server worker - https://phabricator.wikimedia.org/T165669#3283531 (10MoritzMuehlenhoff) [08:49:42] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3210271 (10faidon) stretch now has 0.4.1 (prepared/maintained by yours truly) and I just checked, doesn't suffer from this bug. The right way here would be for us to switch to that, or if we have local patches, r... [08:50:17] !log Restart mysql on db1095 to apply new replication filters - T165977 [08:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:26] T165977: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977 [08:54:30] (03PS4) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 [08:54:36] (03CR) 10jerkins-bot: [V: 04-1] Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 (owner: 10Giuseppe Lavagetto) [08:55:38] (03CR) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 (owner: 10Giuseppe Lavagetto) [08:55:45] !log Restart mysql on db1069 to apply new replication filters - T165977 [08:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:53] T165977: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977 [08:57:58] (03Abandoned) 10ArielGlenn: Support MediaWiki version 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (https://phabricator.wikimedia.org/T68663) (owner: 10Wpmirrordev) [08:58:31] (03Abandoned) 10ArielGlenn: Extend maximum allowed mediawiki version to 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113124 (owner: 10Wpmirrordev) [08:58:39] (03Abandoned) 10ArielGlenn: Extend maximum allowed mediawiki version to 1.24 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/139413 (owner: 10Wpmirrordev) [09:02:47] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283554 (10Marostegui) Both sanitarium hosts have been restarted and they have the new wiki replication filters like: ``` techconductwiki.% ``` db1095 (which... [09:03:48] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283557 (10Marostegui) a:05jcrespo>03None [09:06:00] (03PS1) 10Marostegui: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355073 (https://phabricator.wikimedia.org/T164530) [09:07:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355073 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [09:08:44] 06Operations: Sync internal nutcracker package with Debian package - https://phabricator.wikimedia.org/T166038#3283563 (10MoritzMuehlenhoff) [09:09:00] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355073 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [09:09:09] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355073 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [09:09:16] 06Operations: Sync internal nutcracker package with Debian package - https://phabricator.wikimedia.org/T166038#3283578 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:09:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1026, depool db1045 - T164530 (duration: 00m 39s) [09:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:03] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [09:14:11] (03PS1) 10ArielGlenn: script to generate pagesperchunkhistory config setting for a given wiki [dumps] - 10https://gerrit.wikimedia.org/r/355075 [09:14:12] (03PS1) 10ArielGlenn: make dumps using extension scripts work without MWScript stuff [dumps] - 10https://gerrit.wikimedia.org/r/355076 [09:14:14] (03PS1) 10ArielGlenn: split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) [09:14:38] (03CR) 10jerkins-bot: [V: 04-1] split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) (owner: 10ArielGlenn) [09:14:54] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3283592 (10MoritzMuehlenhoff) I've filed T166038 for rebasing to the stretch 0.4.1 package. I'll proceed with rolling out the current isolated fix in the mean time; with the current behaviour nutcracker is trippi... [09:15:33] !log Drop table MediaWikiInstallPingback_15732959 from db1046, db1047 and dbstore1002 - T165836 [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] T165836: Drop table MediaWikiInstallPingback_15732959 from EventLogging DB - https://phabricator.wikimedia.org/T165836 [09:22:33] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 13Patch-For-Review, 06Release-Engineering-Team (Backlog): Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3283618 (10jcrespo) I also use colors, check my bashrc if you want them: {F8158232} [09:30:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is very useful, but I have some doubts that need to be addressed." (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/354939 (owner: 10Filippo Giunchedi) [09:32:25] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283633 (10jcrespo) Let's also schedule a custom check of check_private_data when it gets added to the private wiki list. [09:34:05] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283634 (10Marostegui) >>! In T165977#3283633, @jcrespo wrote: > Let's also schedule a custom check of check_private_data when it gets added to the private wiki... [09:35:25] (03Abandoned) 10Giuseppe Lavagetto: Add generic Finite States Machine [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [09:37:39] !log Deploy alter table s7.frwiktionary on db1039 - https://phabricator.wikimedia.org/T165743 [09:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:53] <_joe_> damn gerrit [09:38:13] (03PS1) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 [09:38:52] (03CR) 10Giuseppe Lavagetto: "Due to a gerrit bug, I had to resubmit this change as https://gerrit.wikimedia.org/r/355082" [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [09:40:15] (03Abandoned) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 (owner: 10Giuseppe Lavagetto) [09:40:28] (03Abandoned) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 (owner: 10Giuseppe Lavagetto) [09:40:48] (03Abandoned) 10Giuseppe Lavagetto: Add IPVSError as a generic IPVS-related exception [debs/pybal] - 10https://gerrit.wikimedia.org/r/313556 (owner: 10Giuseppe Lavagetto) [09:41:25] (03Abandoned) 10Giuseppe Lavagetto: profile::etcd::replication: add --strip option [puppet] - 10https://gerrit.wikimedia.org/r/341805 (owner: 10Giuseppe Lavagetto) [09:42:17] (03PS2) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: allow read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/353231 (https://phabricator.wikimedia.org/T159687) [09:49:57] (03PS2) 10ArielGlenn: split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) [09:56:11] (03CR) 10Muehlenhoff: [C: 04-1] "There's a number of host entries in hieradata/hosts left" [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341) (owner: 10Elukey) [09:59:20] !log repooled mw2221 (was down for hardware error) [09:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:04] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3102305 (10NickK) This happened again today, this time targeting checkuser-l and another user (will not disclose username here but one thing in common is that this user also uses m... [10:05:11] (03CR) 10Mark Bergsma: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 (owner: 10Giuseppe Lavagetto) [10:10:56] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: allow read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/353231 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [10:16:45] RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK [10:34:01] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: better read-only error reporting [puppet] - 10https://gerrit.wikimedia.org/r/355092 [10:34:03] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: write to localhost via http [puppet] - 10https://gerrit.wikimedia.org/r/355093 [10:47:28] (03PS7) 10Mark Bergsma: Adapt NaiveBGPPeering to support UPDATE message overflow [debs/pybal] - 10https://gerrit.wikimedia.org/r/354686 [10:47:30] (03PS7) 10Mark Bergsma: Allow for withdrawals and NLRI to be sent in the same UPDATE [debs/pybal] - 10https://gerrit.wikimedia.org/r/354723 [10:47:32] (03PS4) 10Mark Bergsma: Add GPLv2 license header to bgp.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354955 [10:50:10] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: better read-only error reporting [puppet] - 10https://gerrit.wikimedia.org/r/355092 (owner: 10Giuseppe Lavagetto) [10:50:53] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::replication: write to localhost via http [puppet] - 10https://gerrit.wikimedia.org/r/355093 (owner: 10Giuseppe Lavagetto) [10:57:58] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: fixup nginx configuration for read-only [puppet] - 10https://gerrit.wikimedia.org/r/355094 [10:59:04] (03PS2) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: fixup nginx configuration for read-only [puppet] - 10https://gerrit.wikimedia.org/r/355094 [11:00:20] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: fixup nginx configuration for read-only [puppet] - 10https://gerrit.wikimedia.org/r/355094 (owner: 10Giuseppe Lavagetto) [11:02:48] (03PS1) 10Giuseppe Lavagetto: profile::nginx::tlsproxy: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/355096 [11:04:45] (03PS2) 10Giuseppe Lavagetto: profile::nginx::tlsproxy: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/355096 [11:06:40] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::nginx::tlsproxy: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/355096 (owner: 10Giuseppe Lavagetto) [11:10:39] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3283820 (10tomasz) @Dereckson: I've uploaded an example logo at https://commons.wikimedia.org/wiki/File:Code_of_Conduct_Committee_logo.svg - can you have a look? [11:18:25] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:18:25] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:18:45] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:19:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:19:15] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:20:15] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:21:05] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:21:15] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:21:45] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:21:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:22:32] (03PS1) 10ArielGlenn: treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 [11:23:45] (03CR) 10jerkins-bot: [V: 04-1] treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 (owner: 10ArielGlenn) [11:38:46] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:39:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:39:07] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:39:25] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:39:25] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:40:15] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:40:25] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:40:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:41:45] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:42:05] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:50:13] (03PS1) 10Hoo man: WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) [12:10:54] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3281090 (10ArielGlenn) We need manager approval for this please. [12:12:43] (03PS2) 10ArielGlenn: treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 [12:16:02] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3283964 (10Jan_Dittrich) I report to @Abraham – Abraham, for the technical wishlist, I need to analyze search queries, and for that the approval of the person I report to is needed (cc @Lea_... [12:31:42] (03PS2) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature in Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103) [12:38:27] (03PS3) 10Jforrester: Use wikitech db group instead of labswiki+ labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [12:40:02] (03PS1) 10Rush: maintain-dbusers: cleanup one-time legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/355103 [12:40:54] (03CR) 10jerkins-bot: [V: 04-1] maintain-dbusers: cleanup one-time legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/355103 (owner: 10Rush) [12:43:11] (03PS2) 10Jforrester: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 [12:49:57] (03PS2) 10Alexandros Kosiaris: Set Type=notify for etcd systemd units [puppet] - 10https://gerrit.wikimedia.org/r/354095 [12:50:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Set Type=notify for etcd systemd units [puppet] - 10https://gerrit.wikimedia.org/r/354095 (owner: 10Alexandros Kosiaris) [12:57:17] Who's doing the SWAT, BTW? Is anyone actually around? :-) [12:59:45] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed [12:59:45] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T1300). [13:00:04] James_F, Dereckson, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:10] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [13:00:20] * James_F waves. [13:00:27] <_joe_> uhm [13:00:29] <_joe_> looking [13:01:19] I'm around, but if my flight does in fact depart at its delayed time, I won't have internet for long enough to finish the SWAT [13:01:57] <_joe_> akosiaris: did you restart etcd with your change? [13:02:01] <_joe_> I guess so [13:02:11] RECOVERY - Etcd replication lag on conf1002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [13:02:17] <_joe_> that sent an incomplete response back to etcdmirror which barfed [13:02:17] and there's the recovery [13:02:17] ic [13:02:28] <_joe_> yeah [13:02:45] RECOVERY - etcdmirror-conftool-codfw-wmnet service on conf1002 is OK: OK - etcdmirror-conftool-codfw-wmnet is active [13:02:45] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [13:02:48] <_joe_> !log restarted etcdmirror on conf1002, consequence of https://gerrit.wikimedia.org/r/354095 [13:02:51] yeah I guess that's expected [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:02] <_joe_> akosiaris: actually that's partly my fault [13:03:14] <_joe_> the tlsproxy should handle 5xx errors [13:03:28] (03PS1) 10Rush: WIP: maintain-dbusers discussion strawman for doc system [puppet] - 10https://gerrit.wikimedia.org/r/355105 [13:03:39] <_joe_> as in, send back a properly formatted response [13:03:44] <_joe_> I'll fix that too [13:04:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: maintain-dbusers discussion strawman for doc system [puppet] - 10https://gerrit.wikimedia.org/r/355105 (owner: 10Rush) [13:04:45] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:05:45] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:07:16] RoanKattouw: It's all config stuff, so… [13:07:48] (03PS3) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183 [13:08:22] James_F: if the patches are easy (no maint script and co) I can probably swat is no one else is around [13:08:48] That'd be awesome. Yeah, all simple stuff from my end at least. [13:09:01] ok so, I can SWAT [13:10:00] Cool. [13:10:23] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [13:11:47] (03Merged) 10jenkins-bot: Enable TimedMediaHandler's new video player Beta Feature in Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [13:12:00] (03CR) 10jenkins-bot: Enable TimedMediaHandler's new video player Beta Feature in Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [13:14:34] James_F: if this is something you can test on mwdebug1002 it's live there [13:14:41] Looking. [13:15:20] dcausse: Looks good. Works in beta, no effect in production, as intended. [13:15:37] ok [13:17:12] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: Enable TimedMediaHandler's new video player Beta Feature in Labs (duration: 00m 43s) [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:43] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [13:18:16] (03CR) 10Alexandros Kosiaris: [C: 032] "Passed strict=>false as well to avoid shipping an upstart script, merging" [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris) [13:18:21] (03PS4) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183 [13:18:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris) [13:18:30] (03PS1) 10Cmjohnson: Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 [13:18:48] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 (owner: 10Cmjohnson) [13:18:55] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3284085 (10Cmjohnson) [13:19:08] (03Merged) 10jenkins-bot: Use wikitech db group instead of labswiki+ labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [13:19:20] (03CR) 10jenkins-bot: Use wikitech db group instead of labswiki+ labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [13:19:44] (03PS2) 10Cmjohnson: Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 [13:21:15] PROBLEM - Disk space on elastic1049 is CRITICAL: Return code of 255 is out of bounds [13:21:25] PROBLEM - Disk space on restbase1010 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - DPKG on wtp1019 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - salt-minion processes on elastic1049 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - configured eth on elastic1049 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - salt-minion processes on restbase1010 is CRITICAL: Return code of 255 is out of bounds [13:21:26] PROBLEM - DPKG on restbase1010 is CRITICAL: Return code of 255 is out of bounds [13:21:27] PROBLEM - Check whether ferm is active by checking the default input chain on wtp1019 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1003 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - salt-minion processes on wtp1019 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - dhclient process on wtp1019 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - dhclient process on wtp1017 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - dhclient process on mw1253 is CRITICAL: Return code of 255 is out of bounds [13:21:35] PROBLEM - nutcracker process on mw1253 is CRITICAL: Return code of 255 is out of bounds [13:21:36] PROBLEM - Check systemd state on mw1253 is CRITICAL: Return code of 255 is out of bounds [13:21:37] PROBLEM - Disk space on wtp1019 is CRITICAL: Return code of 255 is out of bounds [13:21:37] ? [13:21:54] * akosiaris looking [13:22:09] thats the first time ive seen icinga-wm get excess flood'd [13:22:12] oh damn [13:22:14] same issue again? [13:22:23] no, that probably my change [13:22:24] wasn't 255 the same issue than we had with one of the dbstores [13:22:29] oh, ok [13:22:44] is basically everything alerting now? [13:22:46] !log silence icinga [13:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:16] akosiaris: want me to create a revert patch? [13:23:26] Zppix: no, not yet [13:23:33] trying to figure out what's going on [13:23:41] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: Use wikitech db group instead of labswiki+ labtestwiki (duration: 00m 39s) [13:23:44] ack let me know i have the change pulled up akosiaris [13:23:55] all these should not be paging .. what on earth ? [13:24:18] (03PS4) 10DCausse: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353970 (owner: 10Amire80) [13:24:20] 7 sms so far [13:24:29] I 've shut icinga down [13:24:30] <_joe_> akosiaris: should I stop puppet across the fleet? [13:24:36] <_joe_> ouch [13:24:43] well it was paging like crazy [13:24:49] I am gonna disable notifications and start it again [13:25:29] <_joe_> akosiaris: what's happening exactly [13:25:35] !log start icinga again with disable notifications [13:25:42] so we at least have monitoring [13:25:48] _joe_: I am looking now [13:26:11] 1131 critical services unhandled [13:26:21] ok the nrpe systemd unit was crap [13:26:23] * akosiaris fixing [13:26:27] akosiaris: safe to continue swating? [13:26:34] dcausse: yes [13:26:34] <_joe_> akosiaris: should we revert? [13:26:35] ok [13:26:44] no, I 'll fix it.. it's just nrpes [13:26:53] <_joe_> akosiaris: I'd sayy it's /not/ safe to swat [13:27:02] why ? [13:27:27] (03PS1) 10Zppix: Revert "nrpe: Ship a systemd unit file" [puppet] - 10https://gerrit.wikimedia.org/r/355108 [13:27:39] <_joe_> because we won't notice if some alarm fires off [13:27:40] swat with icinga shut down isn't the best idea [13:27:46] I am with joe [13:27:50] it's not shutdown [13:27:52] it's running [13:28:10] <_joe_> akosiaris: fix the unit file, we'll discuss afterwards :) [13:28:13] ok [13:28:27] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353970 (owner: 10Amire80) [13:28:42] <_joe_> dcausse: please wait before syncing [13:28:45] ok [13:29:47] (03Merged) 10jenkins-bot: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353970 (owner: 10Amire80) [13:29:59] (03CR) 10jenkins-bot: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353970 (owner: 10Amire80) [13:31:01] ok found the problem, uploading change [13:31:26] (03Abandoned) 10Zppix: Revert "nrpe: Ship a systemd unit file" [puppet] - 10https://gerrit.wikimedia.org/r/355108 (owner: 10Zppix) [13:32:31] (03PS1) 10Alexandros Kosiaris: nrpe: Set type=forking and pass -d in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/355109 [13:32:43] (03CR) 10Alexandros Kosiaris: [C: 032] nrpe: Set type=forking and pass -d in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/355109 (owner: 10Alexandros Kosiaris) [13:32:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nrpe: Set type=forking and pass -d in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/355109 (owner: 10Alexandros Kosiaris) [13:33:00] thanks akosiaris for fixing so quickly :) [13:33:22] Zppix: it was my mess after all :-) [13:33:37] akosiaris: true and for that *trouts akosiaris * xD [13:34:56] I am gonna wait a while (~20 mins) before enabling notifications again [13:35:27] <_joe_> are you forcing a puppet run across the fleet or waiting it out? [13:39:06] (03PS3) 10DCausse: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester) [13:39:36] (03PS1) 10Muehlenhoff: Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 [13:40:08] anyone else notice that stashbot is gone for a while now and hasnt rejoined? [13:59:26] dcausse: Still going? [13:59:51] James_F: still a problem with icinga, holding the swat for now [14:00:09] I have https://gerrit.wikimedia.org/r/#/c/353970/ merged but not synced [14:00:56] James_F: want me to revert and rescheule them at a later time? [14:07:24] James_F: I'm going to revert the unsynced one to have tin up to date, will add a note to the deployment page [14:07:48] OK. :-( [14:08:25] James_F: yes, sorry :(, but I don't want to leave tin in a bad state [14:08:48] (03PS1) 10DCausse: Revert "Remove special Math extension settings for hewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355112 [14:09:48] * James_F nods. [14:15:39] (03CR) 10DCausse: [C: 032] "SWAT (revert unsynced change)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355112 (owner: 10DCausse) [14:17:12] (03Merged) 10jenkins-bot: Revert "Remove special Math extension settings for hewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355112 (owner: 10DCausse) [14:17:20] (03CR) 10jenkins-bot: Revert "Remove special Math extension settings for hewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355112 (owner: 10DCausse) [14:18:51] (03PS1) 10DCausse: Revert "Revert "Remove special Math extension settings for hewiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 [14:20:58] (03PS2) 10DCausse: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 [14:39:29] (03CR) 10Thiemo Mättig (WMDE): [C: 031] WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [14:51:26] (03CR) 10Ema: [C: 031] Adapt NaiveBGPPeering to support UPDATE message overflow [debs/pybal] - 10https://gerrit.wikimedia.org/r/354686 (owner: 10Mark Bergsma) [14:51:26] <_joe_> !log restarting apache2 on puppetmaster2001 [14:52:54] (03PS8) 10Mark Bergsma: Allow for withdrawals and NLRI to be sent in the same UPDATE [debs/pybal] - 10https://gerrit.wikimedia.org/r/354723 [14:52:56] (03PS5) 10Mark Bergsma: Add GPLv2 license header to bgp.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354955 [14:53:54] (03CR) 10Mark Bergsma: [C: 032] Adapt NaiveBGPPeering to support UPDATE message overflow [debs/pybal] - 10https://gerrit.wikimedia.org/r/354686 (owner: 10Mark Bergsma) [14:54:23] (03CR) 10Ema: [C: 031] Allow for withdrawals and NLRI to be sent in the same UPDATE [debs/pybal] - 10https://gerrit.wikimedia.org/r/354723 (owner: 10Mark Bergsma) [14:54:39] (03Merged) 10jenkins-bot: Adapt NaiveBGPPeering to support UPDATE message overflow [debs/pybal] - 10https://gerrit.wikimedia.org/r/354686 (owner: 10Mark Bergsma) [14:54:52] (03CR) 10Mark Bergsma: [C: 032] Allow for withdrawals and NLRI to be sent in the same UPDATE [debs/pybal] - 10https://gerrit.wikimedia.org/r/354723 (owner: 10Mark Bergsma) [14:55:11] let's try reverting my config patch and see if that helps [14:55:22] (03Merged) 10jenkins-bot: Allow for withdrawals and NLRI to be sent in the same UPDATE [debs/pybal] - 10https://gerrit.wikimedia.org/r/354723 (owner: 10Mark Bergsma) [14:55:39] (03Merged) 10jenkins-bot: Add GPLv2 license header to bgp.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354955 (owner: 10Mark Bergsma) [14:55:52] (03PS1) 10BryanDavis: Revert "Use wikitech db group instead of labswiki+ labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355116 [14:57:04] Hi, it seems nagios keeps timming out when ever i try to start nagios-nrpe-server. [14:57:08] I'm going to merge and scap that config revert to see if it fixes the write master for wikitech [14:57:18] It started today as it has been working for a few months now. [14:57:22] (03CR) 10BryanDavis: [C: 032] Revert "Use wikitech db group instead of labswiki+ labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355116 (owner: 10BryanDavis) [14:58:26] akosiaris ^^ [14:58:36] (03Merged) 10jenkins-bot: Revert "Use wikitech db group instead of labswiki+ labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355116 (owner: 10BryanDavis) [15:00:34] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 39s) [15:00:58] !log last scap was for Revert "Use wikitech db group instead of labswiki+ labtestwiki" [15:01:45] !log Wikitech writes working again [15:03:42] (03CR) 10jenkins-bot: Revert "Use wikitech db group instead of labswiki+ labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355116 (owner: 10BryanDavis) [15:05:18] (03CR) 10BryanDavis: "This change somehow caused wikitech's write db config to point to db1034 instead of silver." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [15:05:31] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3284301 (10Cmjohnson) New bbu and raid card has been ordered and on it's way Service Request – 948416019 Service Tag – 9BJKV12 Dispatch # 326249113 [15:05:52] I don't understand [15:05:59] the default host [15:06:02] is s3 [15:06:12] jynus: I don't understand either [15:06:12] why a random server? [15:06:23] I understand something like a missconfiguration [15:06:30] but why those 2 hosts [15:06:49] db1034 and db1094 [15:07:15] let's have a look at the log error trace [15:08:32] jynus: so somehow the dbname ended up being "metawiki" [15:08:44] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.05.22/mediawiki?id=AVwwowsmqfND4HJ7OUQN&_g=() [15:09:00] :| [15:09:07] How's that possible? [15:09:10] bd [15:09:16] then it could be an extension [15:09:20] so the write path on wikitech decided that the active wikidb was meta?! [15:09:27] that should uses meta [15:09:33] like a global filter or banner, etc. [15:09:43] that is not adecuately disabled [15:09:45] the two config flags that I changed in that patch were extensions. lest see which ones [15:10:09] my bet is on wmgUseGlobalAbuseFilters [15:10:20] yeah, that would fit [15:10:30] however, I aslo saw it trying to connect to s6 [15:10:33] and I'd additionally bet that the way that reedy told me that dbgroup names worked there is a lie [15:10:39] which would not be explained [15:10:53] * bd808 always blames reedy [15:12:52] (03CR) 10BryanDavis: "> This change somehow caused wikitech's write db config to point to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [15:16:13] (03PS3) 10BryanDavis: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 [15:17:44] bd808: labswiki is in medium, I assume that you can only overrides by the dbname not another dbgroup :/ [15:18:59] so in [ 'medium' => true, 'wikitech' => false], medium will win [15:20:08] dcausse: that's a reasonable sounding guess. TL;DR dbgroups are spooky and our config is complicated. [15:22:26] very true :( [15:24:24] 06Operations, 10ops-eqiad: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3284327 (10Cmjohnson) [15:24:33] akosiaris i've figured out another fix for nagios nrpe systemd script you created [15:24:49] (03PS1) 10Alexandros Kosiaris: nrpe: Don't set PrivateTmp=True [puppet] - 10https://gerrit.wikimedia.org/r/355119 [15:25:39] (03PS2) 10Alexandros Kosiaris: nrpe: Don't set PrivateTmp=True [puppet] - 10https://gerrit.wikimedia.org/r/355119 [15:25:41] <_joe_> paladox: not now, please [15:25:51] <_joe_> we're in the middle of a couple fires [15:25:56] ok [15:26:25] (03PS3) 10Alexandros Kosiaris: nrpe: Don't set PrivateTmp=True [puppet] - 10https://gerrit.wikimedia.org/r/355119 (https://phabricator.wikimedia.org/T148507) [15:27:03] <_joe_> !log restarted puppetmasters in codfw [15:27:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nrpe: Don't set PrivateTmp=True [puppet] - 10https://gerrit.wikimedia.org/r/355119 (https://phabricator.wikimedia.org/T148507) (owner: 10Alexandros Kosiaris) [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:18] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3284389 (10Jgreen) >>! In T137161#3277960, @BBlack wrote: > @Jgreen - re: civicrm, it needs to emit the HSTS header on **all** HTTPS responses.... [15:40:43] (03PS5) 10Ema: Move BGP classes to bgp.bgp, IP classes to bgp.ip [debs/pybal] - 10https://gerrit.wikimedia.org/r/354746 [15:43:38] (03CR) 10Mark Bergsma: [C: 032] Move BGP classes to bgp.bgp, IP classes to bgp.ip [debs/pybal] - 10https://gerrit.wikimedia.org/r/354746 (owner: 10Ema) [15:44:19] (03Merged) 10jenkins-bot: Move BGP classes to bgp.bgp, IP classes to bgp.ip [debs/pybal] - 10https://gerrit.wikimedia.org/r/354746 (owner: 10Ema) [15:44:48] (03PS2) 10Ema: bgp: add a few unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/355000 [15:45:00] 06Operations, 07Puppet: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066#3284421 (10Joe) [15:46:37] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3284424 (10elukey) Thanks a lot! I'd prefer if you could tell me something before booting the host, I'd like to reimage it straight away since it is running Trusty and the whole cluster is running Debian now. [15:47:39] (03CR) 10Ema: [V: 032 C: 032] bgp: add a few unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/355000 (owner: 10Ema) [15:48:54] (03PS1) 10Mark Bergsma: Add BGPUpdateMessage attribute method test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 [15:49:15] ^ not done with that yet [15:52:11] (03CR) 10Muehlenhoff: [C: 031] "actually +1, local problem on my end" [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341) (owner: 10Elukey) [15:52:32] (03PS2) 10Mark Bergsma: Add BGPUpdateMessage attribute method test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 [15:58:55] (03PS3) 10ArielGlenn: treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 [16:06:47] !log re-enable notifications in icinga [16:06:50] _joe_ Can i publish my patch i did, no one has to review it until you fixed what fire happended. Just i want to publish it. It's currently in draft so it wont send any notifications. [16:06:54] Please. [16:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:02] paladox: yeah sure, publish it [16:08:08] Thanks [16:08:34] (03Draft1) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [16:08:37] (03PS3) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [16:08:41] (03Draft3) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [16:10:11] Tested ^^ on labs and now all the services have started to work after cherry picking that on to the puppetmaster in my project. [16:11:55] (03Draft1) 10Paladox: Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 [16:11:57] (03PS2) 10Paladox: Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 [16:15:56] 06Operations, 10Traffic: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765#3276305 (10ema) p:05Triage>03Normal [16:16:54] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nagios-nrpe-server] [16:21:54] PROBLEM - Confd template for /etc/dsh/group/parsoid on tegmen is CRITICAL: Return code of 255 is out of bounds [16:21:54] PROBLEM - Confd template for /etc/dsh/group/cassandra on tegmen is CRITICAL: Return code of 255 is out of bounds [16:21:54] PROBLEM - Check whether ferm is active by checking the default input chain on tegmen is CRITICAL: Return code of 255 is out of bounds [16:21:54] PROBLEM - MD RAID on tegmen is CRITICAL: Return code of 255 is out of bounds [16:21:54] PROBLEM - configured eth on tegmen is CRITICAL: Return code of 255 is out of bounds [16:21:59] (03CR) 10Krinkle: "Possible fixme or need for documentation" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [16:22:04] PROBLEM - Check systemd state on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:04] PROBLEM - dhclient process on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:04] PROBLEM - DPKG on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:04] PROBLEM - salt-minion processes on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:20] <_joe_> tegmen again [16:22:24] PROBLEM - Disk space on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:24] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:34] PROBLEM - Check size of conntrack table on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:34] PROBLEM - confd service on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:34] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:35] PROBLEM - ircecho_service_running on tegmen is CRITICAL: Return code of 255 is out of bounds [16:22:36] <_joe_> akosiaris: nrpe keeps dying there [16:23:56] yeah.. looking [16:24:23] (03CR) 10Krinkle: "Basically if more than one group is involved and they overlap you have to use a computed group instead, so that it is reduced to only sett" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354856 (owner: 10BryanDavis) [16:24:29] 06Operations, 10ops-eqiad: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3284602 (10RobH) [16:26:54] RECOVERY - Confd template for /etc/dsh/group/parsoid on tegmen is OK: No errors detected [16:26:54] RECOVERY - Confd template for /etc/dsh/group/cassandra on tegmen is OK: No errors detected [16:26:54] RECOVERY - Check whether ferm is active by checking the default input chain on tegmen is OK: OK ferm input default policy is set [16:26:54] RECOVERY - MD RAID on tegmen is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:26:55] RECOVERY - configured eth on tegmen is OK: OK - interfaces up [16:27:04] RECOVERY - dhclient process on tegmen is OK: PROCS OK: 0 processes with command name dhclient [16:35:11] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3284644 (10RobH) Please note that all NDA signatures must be confirmed by Legal. @RStallman-legalteam is the preferred point of contact for NDA confirmations. [16:36:32] (03CR) 10Ema: Instrumentation fixes (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/354680 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [16:39:20] (03Abandoned) 10Ema: bgp: log with util.log instead of printing to stdout [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 (owner: 10Ema) [16:43:04] 06Operations, 10ops-eqiad, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3284670 (10Joe) [16:43:24] RECOVERY - Disk space on tegmen is OK: DISK OK [16:43:24] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [16:43:34] RECOVERY - Check size of conntrack table on tegmen is OK: OK: nf_conntrack is 9 % full [16:43:34] RECOVERY - confd service on tegmen is OK: OK - confd is active [16:43:34] RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on tegmen is OK: No errors detected [16:43:35] RECOVERY - ircecho_service_running on tegmen is OK: PROCS OK: 2 processes with args ircecho [16:43:47] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3284674 (10RStallman-legalteam) Confirming that Jan Dittrich has a NDA on file for shell access. [16:44:04] RECOVERY - Check systemd state on tegmen is OK: OK - running: The system is fully operational [16:44:04] RECOVERY - DPKG on tegmen is OK: All packages OK [16:44:04] RECOVERY - salt-minion processes on tegmen is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:52:11] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#3284681 (10Dzahn) 05stalled>03Open [16:52:20] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Dzahn) p:05Normal>03High [16:52:27] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3284683 (10RobH) We've gotten a notice from Intel, FWD from Dasher, that they'll be shipping a replacement disk, and a return tag for the defec... [17:00:05] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T1700). Please do the needful. [17:01:04] (03PS1) 10Alexandros Kosiaris: nrpe: Remove user and group from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/355130 [17:01:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nrpe: Remove user and group from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/355130 (owner: 10Alexandros Kosiaris) [17:04:05] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:04:35] PROBLEM - configured eth on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:04:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:05] PROBLEM - dhclient process on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:05] PROBLEM - Disk space on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:15] PROBLEM - nutcracker port on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:15] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:15] PROBLEM - puppet last run on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:25] PROBLEM - salt-minion processes on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:26] PROBLEM - Check systemd state on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:26] PROBLEM - nutcracker process on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:26] PROBLEM - DPKG on mw1167 is CRITICAL: Return code of 255 is out of bounds [17:05:28] (03CR) 10Alexandros Kosiaris: "Solved somewhat differently in https://gerrit.wikimedia.org/r/#/c/355130/" [puppet] - 10https://gerrit.wikimedia.org/r/355122 (owner: 10Paladox) [17:06:15] RECOVERY - nutcracker port on mw1167 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:06:15] RECOVERY - Check size of conntrack table on mw1167 is OK: OK: nf_conntrack is 70 % full [17:06:15] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:06:25] RECOVERY - salt-minion processes on mw1167 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:06:25] RECOVERY - Check systemd state on mw1167 is OK: OK - running: The system is fully operational [17:06:26] RECOVERY - nutcracker process on mw1167 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [17:06:26] RECOVERY - DPKG on mw1167 is OK: All packages OK [17:06:40] did someone do something? [17:06:53] jynus: me.. this => https://gerrit.wikimedia.org/r/#/c/355130/ [17:07:03] ok, thanks [17:07:14] fixed the final (hopefully) bug in the systemd unit script [17:07:16] I was worring a mass failure again :-) [17:07:22] I was not expecting it to be so bad [17:07:27] a damn systemd unit script [17:07:28] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3284697 (10Papaul) Thanks [17:09:52] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3284699 (10RobH) So it turns out Intel wants the disk sent back in advance. Can this disk detect enough for us to perform an wipe on it? Othe... [17:10:06] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3284700 (10RobH) a:05RobH>03Papaul [17:10:50] akosiaris hmm, shows Cannot write to pidfile '/var/run/nagios/nrpe.pid' - check your privileges. [17:10:56] and dosen't show green for running [17:11:35] paladox: it's running fine now in production [17:11:45] Oh, it was after a reboot [17:12:07] ah, ok [17:12:48] So it seems that restarting an instance will not start the service without manual intervention for me. [17:13:13] Hmm, it still hits timeout [17:13:16] and fails to start [17:13:45] Failed to restart restart.service: Unit restart.service failed to load: No such file or directory. [17:13:51] systemctl restart nagios-nrpe-server just hangs. [17:14:00] restart.service ? [17:14:03] what's that ? [17:14:09] typo ? [17:14:35] (03PS4) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [17:14:44] fwiw, stop, start and restart work just fine in production [17:14:50] Oh. [17:14:56] Strange how it fails on labs. [17:19:28] akosiaris ah, fixed it. by adding a safe check like ExecStartPre=-/bin/mkdir -p /var/run/nagios/ [17:19:46] Which we also do for other systemd scripts like phd for phabricator. [17:20:08] (03PS5) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [17:20:43] (03PS6) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 [17:20:55] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3284712 (10RobH) It looks like this arrived last Friday: https://wwwapps.ups.com/WebTracking/processInputRequest?tracknums_displayed=5&TypeOfInquiryNumber=T&HTMLVersion=5.0&AgreeToTermsAndConditions=yes&Requester=UI... [17:21:27] (03PS1) 10Alexandros Kosiaris: check_cpufreq: Issue a CRITICAL, not a WARNING [puppet] - 10https://gerrit.wikimedia.org/r/355132 [17:22:16] paladox: how did your VM end up without /var/run/nagios existing ? [17:22:28] Im not sure. [17:22:48] I rebooted and seemed that didnt do it without help from doing [17:22:49] ExecStartPre=-/bin/mkdir -p /var/run/nagios/ [17:23:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] check_cpufreq: Issue a CRITICAL, not a WARNING [puppet] - 10https://gerrit.wikimedia.org/r/355132 (owner: 10Alexandros Kosiaris) [17:25:01] also i needed that because even if i didnt reboot it woulden't create the directory. Im not sure why it worked in prod but fails in labs. [17:25:18] It happened with phd which is why we need this safe guard. [17:29:56] paladox: ok I 'll have a closer look tomorrow. you maybe on to something [17:30:08] Ok thanks :) [17:34:50] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3284732 (10RobH) I've put elastic2020 into maint mode in icinga for the next month, and have shut it down. @Papaul, you can boot the system no... [17:35:27] akosiaris i think this is a bug as systemd should ensure the directory is created before creating the pid. [17:41:37] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3284770 (10Dereckson) Looks good to me. I've cc to the task some other members, so we'll get perhaps more feedback. [17:45:36] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat-kill] [17:48:36] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:50:46] PROBLEM - CPU frequency on acamar is CRITICAL: CRIT, CPU frequency is 600 MHz (187) [17:51:05] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3284786 (10Cmjohnson) Replaced the motherboard, plugged back into console connected to iron. Plugged in (which is the only means of powering on and off) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T1800). Please do the needful. [18:01:13] Hello [18:01:46] AaronSchulz: ping [18:10:23] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3284802 (10RobH) a:05Cmjohnson>03faidon Ok, the old board has to be returned, but Wim didn't give us any return instructions yet. (He advised he was, but hasn't yet.) I'll ping him for a followup. Additionall... [18:11:01] re: acamar, is someone actively working on rebooting/fixing it? [18:11:21] not i [18:11:35] ok I'll take a look [18:11:38] i can if you want [18:11:45] i just parsed that may have been you asking cuz it died [18:11:54] i parsed it as 'i wanna do someting on it' which was likely incorrect [18:11:54] heh [18:12:11] ahh, cpu error =P [18:13:01] yeah the 180mhz thing [18:13:11] or whatever, ~164Mhz presently [18:13:30] apparently blacklisting acpi_pad isn't enough :P [18:14:25] well, shit. [18:15:40] anyways, I'm taking a look, I think I'll try manually blacklisting mei [18:18:14] !log rebooting acamar [18:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:06] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [18:19:46] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:46] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka2002.codfw.wmnet because of too many down!: trendingedits_6699 - Could not depool server scb2004.codfw.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus2004.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many down!: [18:19:46] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:06] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka2003.codfw.wmnet because of too many down!: trendingedits_6699 - Could not depool server scb2002.codfw.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus2004.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many down!: [18:20:33] PROBLEM - LVS HTTP IPv4 on eventbus.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:00] uh oh [18:21:15] nice [18:21:21] just got paged :) [18:21:22] RECOVERY - LVS HTTP IPv4 on eventbus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1417 bytes in 0.001 second response time [18:21:22] RECOVERY - Host acamar is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:21:25] <_joe_> what's up? [18:21:26] .... [18:21:30] uh? [18:21:31] <_joe_> oh, acamar I guess [18:21:34] damn is it tied to acamar or just coinicende [18:21:36] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2001 is OK: All endpoints are healthy [18:21:39] coincidence even. [18:21:40] ah [18:21:40] probably related to acamar, yes [18:21:43] yeah [18:21:46] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2003 is OK: All endpoints are healthy [18:21:49] although it shouldn't be :P [18:21:54] <_joe_> ok off again [18:21:56] RECOVERY - CPU frequency on acamar is OK: OK. CPU frequency is = 600 MHz (1232) [18:22:07] * apergos peeks back out [18:22:28] * elukey checks eventbus' logs.. [18:23:05] https://phabricator.wikimedia.org/T162818 [18:23:11] probably related to ^ [18:23:36] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:23:46] PROBLEM - Check systemd state on acamar is CRITICAL: Return code of 255 is out of bounds [18:23:56] PROBLEM - puppet last run on acamar is CRITICAL: Return code of 255 is out of bounds [18:24:06] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [18:24:36] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.099446 secs [18:24:41] so on kafka2001 I can see [18:24:41] May 22 18:10:00 kafka2001 eventlogging-service-eventbus[1073]: (eventlogging-aa06ba9a-34c3-11e7-bcc5-141877396f37-kafka2001.codfw.wmnet.1073-network-thread) Node 2001 connection failed -- refreshing metadata [18:24:47] RECOVERY - Check systemd state on acamar is OK: OK - running: The system is fully operational [18:24:56] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [18:25:19] just checking that kafkat et all are o [18:25:19] ok [18:25:34] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-cluster=main-codfw&var-kafka_brokers=All&var-kafka_servers=All&from=now-3h&to=now [18:26:17] and https://grafana.wikimedia.org/dashboard/db/eventbus?from=now-3h&to=now&refresh=1m&orgId=1&var-site=codfw&var-topic=All [18:28:54] so nagios-nrpe-service keeps failing on acamar [18:28:59] is this still part of some known issue? [18:29:57] (main-codfw kafka cluster looks ok, eventbus codfw seems ok too) [18:32:43] May 22 18:29:39 acamar nrpe[4352]: Cannot write to pidfile '/var/run/nagios/nrpe.pid' - check your privileges [18:32:50] someone quoted this error earlier [18:32:52] what's the fix? [18:33:13] (no idea sorry) [18:33:16] PROBLEM - configured eth on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:16] PROBLEM - Check size of conntrack table on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:33] akosiaris: ? [18:33:46] PROBLEM - Check whether ferm is active by checking the default input chain on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:46] PROBLEM - dhclient process on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:46] PROBLEM - salt-minion processes on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:46] PROBLEM - Check systemd state on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:56] PROBLEM - Disk space on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:56] PROBLEM - DPKG on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:56] PROBLEM - CPU frequency on acamar is CRITICAL: Return code of 255 is out of bounds [18:33:56] PROBLEM - MD RAID on acamar is CRITICAL: Return code of 255 is out of bounds [18:34:04] (those are just the failing nrpe checks) [18:34:17] I'm guessing some fix was salted but missed acamar while it was being slow [18:34:43] bblack: AFAIK it was fixed with https://github.com/wikimedia/puppet/commit/cab6d7101eaa93d37adcc271a4d69059d921747c [18:35:08] that fix already landed on acamar [18:35:27] and I saw pala.dox saying something about failing upon restart [18:35:32] sorry reboot [18:35:46] see above like 1h10m ago [18:36:12] ExecStartPre=-/bin/mkdir -p /var/run/nagios/ [18:36:16] RECOVERY - configured eth on acamar is OK: OK - interfaces up [18:36:17] RECOVERY - Check size of conntrack table on acamar is OK: OK: nf_conntrack is 0 % full [18:36:27] not sure what was the outcome though [18:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on acamar is OK: OK ferm input default policy is set [18:36:46] RECOVERY - salt-minion processes on acamar is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:36:46] RECOVERY - dhclient process on acamar is OK: PROCS OK: 0 processes with command name dhclient [18:36:47] RECOVERY - Check systemd state on acamar is OK: OK - running: The system is fully operational [18:36:56] RECOVERY - Disk space on acamar is OK: DISK OK [18:36:56] RECOVERY - DPKG on acamar is OK: All packages OK [18:36:56] RECOVERY - CPU frequency on acamar is OK: OK. CPU frequency is = 600 MHz (1291) [18:36:56] RECOVERY - MD RAID on acamar is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:38:21] how did all of those start working when nrpe is still dead? :P [18:38:34] lol [18:39:09] now it's fixed (nrpe) [18:39:27] what did you do? [18:39:29] also, this msg is confusing: CPU frequency on acamar is OK: OK. CPU frequency is = 600 MHz (1291) [18:39:37] yeah should be > [18:39:46] I applied (manually with puppet disabled) the unit file mkdir fixup [18:40:14] ok [18:40:26] modules/base/files/monitoring/check_cpufreq: echo "OK. CPU frequency is >= ${min_mhz} MHz ($cpu_freq)" [18:40:37] ^ somehow the greater-than sign gets lost in translation somewhere before IRC [18:41:26] IRC, on Icinga is ok [18:41:28] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=acamar&service=CPU+frequency [18:44:24] (03PS1) 10BBlack: nrpe-server: mkdir for pidfile in ExecStartPre [puppet] - 10https://gerrit.wikimedia.org/r/355147 [18:46:26] 06Operations, 10ops-eqiad, 15User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284834 (10RobH) [18:47:13] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3284851 (10BBlack) acamar hit this again on Sunday, in spite of the (working) `acpi_pad` blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist... [18:47:39] (03CR) 10BBlack: [C: 032] nrpe-server: mkdir for pidfile in ExecStartPre [puppet] - 10https://gerrit.wikimedia.org/r/355147 (owner: 10BBlack) [18:48:56] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 45 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[nagios-nrpe-server] [18:49:56] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:52:07] (03PS1) 10ArielGlenn: remove some unused dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355148 [18:54:36] (03CR) 10Daniel Kinzler: [C: 031] WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [18:56:29] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.20 (duration: 01m 21s) [18:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:35] !log demon@tin Synchronized README: forcing co-master sync (duration: 00m 42s) [18:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:46] (03PS1) 10ArielGlenn: improve names of dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355149 [19:05:45] bblack that's the same problem i had, fixed it with https://gerrit.wikimedia.org/r/#/c/355122/ [19:11:13] paladox: oh sorry I didn't see your outstanding commit, I ended up merging a similar one though [19:11:21] Oh [19:11:23] https://gerrit.wikimedia.org/r/#/c/355147/ [19:11:45] Thanks [19:11:46] :) [19:13:13] (03Abandoned) 10Paladox: nagios-nrpe-server: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/355122 (owner: 10Paladox) [19:13:35] (03PS4) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [19:14:58] (03CR) 10Chad: "PS4 contains revisions plus rebase on top of profile/role fixups. That being said: this feels kludgy and requires maintenance if we add/re" [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [19:16:19] !log BBR: cp1065: switching qdisc to mq+fq manually - T147569 [19:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:27] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [19:16:41] (03CR) 10Chad: "Ping?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [19:16:55] (03PS3) 10Mark Bergsma: Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 [19:18:02] (03PS1) 10ArielGlenn: cleanup the dump list commands template syntax [puppet] - 10https://gerrit.wikimedia.org/r/355151 [19:18:39] (03PS1) 10Mark Bergsma: Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 [19:25:10] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3284945 (10Trijnstel) >>! In T160529#3283666, @NickK wrote: > This happened again today, this time targeting checkuser-l and another user (will not disclose username here but one t... [19:25:17] !log BBR: cp1065: switching congestion control to bbr manually - T147569 [19:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:27] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [19:29:15] !log BBR: cp1074: switching qdisc to mq+fq manually - T147569 [19:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:47] !log BBR: cp1074: switching congestion control to bbr manually - T147569 [19:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:56] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [19:33:26] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [19:33:26] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [19:34:06] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [19:34:16] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [19:34:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [19:36:06] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [19:36:06] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [19:36:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:36:16] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:36:17] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:39:04] (03CR) 10Muehlenhoff: [C: 032] gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [19:58:07] (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.2) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355155 [19:58:19] akosiaris: should profiles ever have "system::role" in them? should only real roles have system::role, so profiles lose them when being coverted? [19:58:41] or do we rename the thing that adds motd snippets [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T2000). [20:00:56] I have a deploy for ores [20:03:37] (03PS1) 10Dzahn: contint: role/profile conversion [puppet] - 10https://gerrit.wikimedia.org/r/355156 [20:07:09] (03CR) 10Dzahn: "questions: should the profiles really not use system::role anymore or should they? and: what exactly in labs uses the roles i am renamin" [puppet] - 10https://gerrit.wikimedia.org/r/355156 (owner: 10Dzahn) [20:11:36] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 61027 MB (12% inode=99%) [20:14:09] !log starting deploy of ores:68cca85 to prod [20:14:14] !log ladsgroup@tin Started deploy [ores/deploy@263255a]: (no justification provided) [20:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:47] it does it already, it has been a while since I deployed ores [20:14:49] (03PS2) 10Dzahn: contint: role/profile conversion [puppet] - 10https://gerrit.wikimedia.org/r/355156 [20:15:53] (03CR) 10Dzahn: [C: 031] Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 (owner: 10Paladox) [20:17:25] (03CR) 10Dzahn: [C: 032] gerrit (2.13.8+git1-wmf.2) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355155 (owner: 10Chad) [20:18:28] (03PS3) 10Dzahn: Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 (owner: 10Paladox) [20:20:12] (03CR) 10Dzahn: [C: 031] "re: "Use -p to make sure the folder exists before trying to create it" it's "ignore if it already exists" instead of "make sure that it ex" [puppet] - 10https://gerrit.wikimedia.org/r/355125 (owner: 10Paladox) [20:21:51] canary looks okay, going for all [20:23:54] (03PS4) 10Dzahn: Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 (owner: 10Paladox) [20:24:16] (03CR) 10Dzahn: [C: 032] Phabricator: Use mkdir -p for creating phd directory in systemd [puppet] - 10https://gerrit.wikimedia.org/r/355125 (owner: 10Paladox) [20:24:45] Thanks ^^ [20:32:45] !log arlolra@tin Started deploy [parsoid/deploy@a9f2229]: Updating Parsoid to ebac1890 [20:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:39] !log arlolra@tin Finished deploy [parsoid/deploy@a9f2229]: Updating Parsoid to ebac1890 (duration: 07m 54s) [20:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:26] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:43:06] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:43:16] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:43:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:43:17] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [20:43:21] !log ladsgroup@tin Finished deploy [ores/deploy@263255a]: (no justification provided) (duration: 29m 07s) [20:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:57] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:44:06] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [20:44:16] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [20:45:46] another deploy is left [20:46:44] !log Updated Parsoid to ebac1890 (T165139) [20:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:54] T165139: Extension output is wrapped in
breaking editing in VE and rendering elsewhere - https://phabricator.wikimedia.org/T165139 [20:49:30] arlolra: tell me when you're done. Thanks [20:49:41] Amir1: all done [20:49:48] thanks [20:50:45] !log ladsgroup@tin Started deploy [ores/deploy@4874809]: Second deploy of ores for enabling frwiki damaging [20:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:55] (03CR) 10Dzahn: "pretty sure you can't do "require => ''", that will break and not mean "nothing required". i asked" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [20:51:38] (03CR) 10Dzahn: ""undef" could work." [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [20:52:11] (03PS8) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [20:52:16] (03PS9) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [20:55:52] canary died, rolling back [20:56:09] !log ladsgroup@tin Finished deploy [ores/deploy@4874809]: Second deploy of ores for enabling frwiki damaging (duration: 05m 23s) [20:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:38] (03CR) 10Dereckson: "Initially scheduled this Monday 18:00 UTC, but not deployed. Please reschudle it in another SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [20:57:41] (03CR) 10Dereckson: "Initially scheduled this Monday 18:00 UTC, but not deployed. Please reschudle it in another SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [21:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T2100). Please do the needful. [21:03:36] RECOVERY - Disk space on elastic1023 is OK: DISK OK [21:08:16] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:08:16] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:08:17] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:08:17] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:08:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:08:27] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:10:01] !log BBR: cp1074: reverted back to cubic+pfifo_fast - T147569 [21:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:10] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [21:10:56] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:11:01] !log BBR: cp1065: reverted back to cubic+pfifo_fast - T147569 [21:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:16] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:12:06] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:12:06] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:12:16] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [21:12:16] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [21:12:16] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [21:12:17] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [21:12:26] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [21:14:06] is citiod behavior related to the recent deployment of ores? [21:19:17] Amir1: it's been doing that all weekend :/ [21:35:07] PROBLEM - Nginx local proxy to apache on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time [21:35:07] PROBLEM - HHVM rendering on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [21:36:06] RECOVERY - Nginx local proxy to apache on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [21:36:06] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 79458 bytes in 0.316 second response time [21:39:56] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:19:16] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:16] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:26] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:26] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:26] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:26] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:19:36] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:21:26] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [22:21:49] (03CR) 10Nemo bis: [C: 031] "Per my comment on the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354549 (https://phabricator.wikimedia.org/T121995) (owner: 10Dereckson) [22:22:06] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [22:22:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [22:22:16] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [22:22:17] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [22:22:17] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [22:22:17] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170522T2300). Please do the needful. [23:00:05] Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:06:47] * AaronSchulz goes [23:06:57] (03PS7) 10Aaron Schulz: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 [23:07:04] (03CR) 10Aaron Schulz: [C: 032] Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [23:11:26] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:11:27] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:11:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:11:36] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:11:36] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:13:22] (03Merged) 10jenkins-bot: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [23:13:37] (03CR) 10jenkins-bot: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [23:14:26] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [23:14:26] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [23:14:26] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [23:14:27] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [23:14:27] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [23:15:33] !log aaron@tin Synchronized wmf-config/logging.php: Include DB shard in production SPI log entries (duration: 00m 38s) [23:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:02] (03CR) 10Aaron Schulz: [C: 032] Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [23:18:35] (03Merged) 10jenkins-bot: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [23:18:44] (03CR) 10jenkins-bot: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [23:19:58] !log aaron@tin Synchronized wmf-config/ProductionServices.php: Move swift auth URL to ProductionServices (duration: 00m 38s) [23:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:52] !log aaron@tin Synchronized wmf-config/filebackend.php: Move swift auth URL to ProductionServices (duration: 00m 38s) [23:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:26] (03CR) 10Paladox: "This breaks beta scap https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/17268/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [23:27:25] (03PS1) 10Chad: Revert "Move swift auth URL to ProductionServices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355170 [23:27:32] (03CR) 10Chad: [C: 032] Revert "Move swift auth URL to ProductionServices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355170 (owner: 10Chad) [23:30:30] (03Merged) 10jenkins-bot: Revert "Move swift auth URL to ProductionServices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355170 (owner: 10Chad) [23:30:39] (03CR) 10jenkins-bot: Revert "Move swift auth URL to ProductionServices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355170 (owner: 10Chad) [23:33:43] !log demon@tin Synchronized wmf-config/filebackend.php: I4b19b4a8f4f1ff7ad65fc02c0b89da651a883524 (duration: 00m 38s) [23:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:30] !log demon@tin Synchronized wmf-config/ProductionServices.php: I4b19b4a8f4f1ff7ad65fc02c0b89da651a883524 (duration: 00m 38s) [23:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:50] (03PS1) 10Aaron Schulz: Set mediaSwift* keys in LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 [23:35:11] AaronSchulz: I reverted you too [23:35:20] Didn't know if you were around. [23:39:45] (03PS2) 10Aaron Schulz: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 [23:40:33] RainbowSprinkles: t'was patching. I'll squashed into https://gerrit.wikimedia.org/r/#/c/355172/ now. [23:40:39] K [23:40:43] *all squashed, heh [23:41:13] (03PS3) 10Dzahn: contint: role/profile conversion [puppet] - 10https://gerrit.wikimedia.org/r/355156 [23:41:28] In an ideal world, we'd standardize names and could just foreach the DCs ;-) [23:41:39] But beta makes up funny names [23:42:32] (03CR) 10Chad: [C: 031] Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 (owner: 10Aaron Schulz) [23:43:07] RainbowSprinkles: is https://gerrit.wikimedia.org/r/#/c/354586/ going on? [23:43:25] otherwise, I'll do another go since prod looked fine [23:43:32] I wasn't doing swat [23:43:52] Dereckson didn't say anything, and can self-deploy if so desired [23:44:12] I assume it's a regular sync-file deal, I could just do that I suppose [23:44:51] Yeah, sync-file. I'd do the Services files first, then filebackend [23:44:54] (03CR) 10Aaron Schulz: [C: 032] Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811) (owner: 10Dereckson) [23:44:56] (reverse of what I did) [23:45:04] Ah, for yours I meant [23:46:19] (03Merged) 10jenkins-bot: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811) (owner: 10Dereckson) [23:46:29] (03CR) 10jenkins-bot: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811) (owner: 10Dereckson) [23:47:30] (03CR) 10Aaron Schulz: [C: 032] Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 (owner: 10Aaron Schulz) [23:48:32] (03Merged) 10jenkins-bot: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 (owner: 10Aaron Schulz) [23:48:42] (03CR) 10jenkins-bot: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355172 (owner: 10Aaron Schulz) [23:48:54] !log aaron@tin Synchronized static/images/project-logos/hywiki-1.5x.png: Fix hy.wikipedia high resolution logos (duration: 00m 38s) [23:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:43] !log aaron@tin Synchronized static/images/project-logos/hywiki-2x.png: Fix hy.wikipedia high resolution logos (duration: 00m 38s) [23:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:56] (03CR) 10Chad: [C: 031] Add techconduct.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/354954 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [23:50:21] (03CR) 10Chad: [C: 031] Set initial configuration for techconduct.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [23:51:22] !log aaron@tin Synchronized wmf-config: Move swift auth URL to ProductionServices (duration: 00m 52s) [23:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:43] (03CR) 10Dzahn: [C: 032] Add techconduct.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/354954 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [23:52:24] AaronSchulz: Heh, you could possibly race on apaches where filebackend lands before *Services.php [23:52:32] (hence why I suggested ordered sync-file) [23:52:55] But self-fixes...soon [23:53:11] in theory, the first time I did PS, then fb. [23:53:33] It would be three syncs now though. [23:53:48] Could do sync-file then sync-dir [23:53:50] But yeah [23:53:51] afaik we still do the /tmp rename step, so it's a dot of a window [23:53:53] It's all kind of ugly [23:54:03] (on the resync level) [23:56:22] (03PS3) 10Dzahn: graphite: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353364