[00:12:21] (03PS1) 10Mattflaschen: Add flow-create-board for gomwiki sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298405 (https://phabricator.wikimedia.org/T139226) [00:18:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:19:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:13] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:24:48] (03CR) 10EBernhardson: [C: 031] logstash: Remove all _* fields from gelf records [puppet] - 10https://gerrit.wikimedia.org/r/298382 (owner: 10BryanDavis) [00:25:15] (03CR) 10EBernhardson: [C: 031] logstash: Remove normalize_fields fitler [puppet] - 10https://gerrit.wikimedia.org/r/298381 (owner: 10BryanDavis) [00:30:03] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107909 MB (15% inode=99%) [00:30:16] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2450973 (10Dzahn) 05Open>03stalled [00:32:39] (03PS2) 10BBlack: Zero VCL: remove ZeroTLS header/cookie [puppet] - 10https://gerrit.wikimedia.org/r/294052 [00:34:31] (03CR) 10BBlack: [C: 032] Zero VCL: remove ZeroTLS header/cookie [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [00:37:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:37:28] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107442 MB (15% inode=99%) [00:41:21] (03CR) 10BBlack: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [00:45:01] (03PS2) 10BBlack: Modify secret.rb to accept a file list and use first match, like http://www.puppetcookbook.com/posts/select-a-file-based-on-a-fact.html [puppet] - 10https://gerrit.wikimedia.org/r/294331 (owner: 10Jgreen) [00:50:39] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: puppet fail [00:51:37] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 108452 MB (15% inode=99%) [00:51:59] PROBLEM - Host labvirt1012 is DOWN: PING CRITICAL - Packet loss = 100% [00:58:17] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109289 MB (15% inode=99%) [00:58:58] RECOVERY - Host labvirt1012 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [01:09:56] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/install/deploy labvirt nodes - https://phabricator.wikimedia.org/T138509#2451070 (10Andrew) That video rendering instance isn't pooled yet, so I just rebooted 1012 and turned on HT. Does it look right now? [01:10:57] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109463 MB (15% inode=99%) [01:17:28] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109248 MB (15% inode=99%) [01:19:38] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:29:17] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80082 MB (15% inode=99%) [01:43:25] (03PS1) 10Dzahn: wikistats: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298409 [01:49:49] (03PS1) 10Dzahn: spare: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298410 [01:50:23] (03PS2) 10Dzahn: wikistats: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298409 [01:52:33] (03PS2) 10Dzahn: spare: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298410 [02:03:47] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:57] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:21:10] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.9) (duration: 08m 44s) [02:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:51] (03PS1) 10Dzahn: iegreview: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298411 [02:26:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 12 02:26:41 UTC 2016 (duration 5m 31s) [02:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:40] (03PS1) 10Dzahn: ipv6relay: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298412 [02:36:30] (03CR) 10Dzahn: [C: 04-1] ipv6relay: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298412 (owner: 10Dzahn) [02:40:37] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80474 MB (15% inode=99%) [02:47:08] RECOVERY - Disk space on elastic1020 is OK: DISK OK [02:50:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [02:50:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:04:06] RECOVERY - Disk space on elastic1018 is OK: DISK OK [03:38:49] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:38:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:08] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [03:41:08] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:02:19] PROBLEM - Disk space on elastic2008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109234 MB (15% inode=99%) [04:03:17] (03CR) 10BryanDavis: [C: 031] iegreview: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298411 (owner: 10Dzahn) [04:15:29] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 13390 MB (3% inode=99%) [04:28:37] (03PS1) 10Ori.livneh: All wikis back to 1.28.0-wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298414 [04:29:14] (03CR) 10Ori.livneh: [C: 032] "Coordinated with Greg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298414 (owner: 10Ori.livneh) [04:30:21] (03Merged) 10jenkins-bot: All wikis back to 1.28.0-wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298414 (owner: 10Ori.livneh) [04:34:53] !log ori@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [04:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:35:11] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:01] RECOVERY - Disk space on elastic2008 is OK: DISK OK [04:37:41] !log Reverted all wikis to wmf8 due to tenfold increase in T119736 [04:37:42] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [04:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:39:22] PROBLEM - MegaRAID on ms-be1012 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [04:41:02] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [04:41:21] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109540 MB (15% inode=99%) [04:41:41] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:44:42] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures [04:55:32] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [05:05:35] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [05:46:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:50:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:50:54] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:59:59] !log running checkLocalUsers.php on terbium [06:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:02:46] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109375 MB (15% inode=99%) [06:08:36] PROBLEM - Disk space on elastic2007 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107004 MB (15% inode=99%) [06:11:18] (03PS1) 10Legoktm: Don't block logins if CentralAuthUser::queryAttached() fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298416 (https://phabricator.wikimedia.org/T119736) [06:15:16] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:24:29] legoktm :-) *thumbs up* [06:26:59] (03CR) 10Ori.livneh: [C: 031] Don't block logins if CentralAuthUser::queryAttached() fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298416 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [06:27:21] (03CR) 10Legoktm: [C: 032] Don't block logins if CentralAuthUser::queryAttached() fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298416 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [06:28:00] (03Merged) 10jenkins-bot: Don't block logins if CentralAuthUser::queryAttached() fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298416 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [06:29:05] syncing to mw1017 first... [06:31:28] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Don't block logins if CentralAuthUser::queryAttached() fails - T119736 (duration: 00m 27s) [06:31:58] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [06:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:32:27] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [06:32:36] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:07] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:36] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:37] RECOVERY - Disk space on lithium is OK: DISK OK [06:55:57] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:47] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:40] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [07:17:40] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109311 MB (15% inode=99%) [07:19:48] (03PS4) 10Giuseppe Lavagetto: service_checker: use external package [puppet] - 10https://gerrit.wikimedia.org/r/297761 [07:21:51] (03CR) 10Giuseppe Lavagetto: [C: 032] service_checker: use external package [puppet] - 10https://gerrit.wikimedia.org/r/297761 (owner: 10Giuseppe Lavagetto) [07:28:49] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109582 MB (15% inode=99%) [07:30:59] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:34:51] <_joe_> gehel: elastic2009 is expected I guess? [07:35:50] 06Operations, 13Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#2451595 (10Aklapper) So.... anyone wants to make a decision (kill vs. decline)? [07:42:29] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:46:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:49:02] !log removing api servers mw113[0-9] from service via conftool as first decom step (T139353) [07:49:04] T139353: Decommission all old mediawiki appservers in eqiad - https://phabricator.wikimedia.org/T139353 [07:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:55:28] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:57:49] PROBLEM - Disk space on elastic2009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107914 MB (15% inode=99%) [08:05:05] RECOVERY - Disk space on elastic2007 is OK: DISK OK [08:34:20] (03PS1) 10Addshore: Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) [08:35:16] (03PS2) 10Addshore: Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) [08:35:17] _joe_: elastic2009 IS probably the usual cluster imbalance. [08:36:14] _joe_: just out of the hospital, I'll check in a few minutes, but it does not seem to be something too worrisome [08:36:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:36:37] <_joe_> yeah don't worry :) [08:37:44] <_joe_> checking ^^ [08:40:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:47:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:49:11] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/install/deploy labvirt nodes - https://phabricator.wikimedia.org/T138509#2451849 (10Southparkfan) 05Open>03Resolved Yes, it does. [09:08:47] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:14:59] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2451947 (10TheDJ) Times, esp. italics of it look extremely jagged on that PNG to me. Terminal might 'seem' a bit better compar... [09:17:16] RECOVERY - Disk space on elastic2009 is OK: DISK OK [09:18:31] <_joe_> gehel: ^^ :) [09:19:18] _joe_: yep, I was looking into it... and it just fixed itself (I must have scared those servers) [09:19:35] (03PS1) 10Gilles: Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) [09:20:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqiad:xe-4/2/0 (Telia, IC-307235, 34ms) {#10693} [10Gbps wave]BR [09:21:53] ^ scheduled maintenance from telia [09:22:22] !log progressively delete esams swift containers, unused and not in production [09:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:42] _joe_: response times of both eqiad and codfw elasticsearch cluster is not looking great. Having a look... [09:22:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [09:23:07] (03CR) 10jenkins-bot: [V: 04-1] Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) (owner: 10Gilles) [09:25:56] PROBLEM - very high load average likely xfs on ms-be3004 is CRITICAL: CRITICAL - load average: 130.93, 156.98, 100.96 [09:28:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "see comment about configuration vs autodetection" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298297 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [09:32:01] !log reboot ms-be3004 / high load average and xfs unhappy [09:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:05] !log lowering elasticsearch codfw high watermark to rebalance cluster [09:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:45] (03CR) 10Hashar: "Looks like all use cases are already covered on Wikitech or can be trivially added to the couple instances that would need them." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [09:37:37] RECOVERY - very high load average likely xfs on ms-be3004 is OK: OK - load average: 20.98, 5.91, 2.02 [09:44:00] (03PS2) 10Gilles: Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) [09:48:31] (03CR) 10Hashar: "Danke Daniel.Will wait a bit to check whether tidy actually garbage collect the files." [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [09:54:04] 06Operations, 10Wikimedia-Apache-configuration: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2452012 (10fgiunchedi) p:05Triage>03Low triaging as low as I'm not sure of the actual impact (?) the things to do IMO would be: [] audi... [10:04:15] (03PS1) 10Mobrovac: WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 [10:05:27] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2452020 (10fgiunchedi) p:05Triage>03Low there were indeed many exim queued messages around the time this issue was reported, are the delays still present? [10:05:54] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2452022 (10Pokefan95) >>! In T137973#2447037, @biplabanand wrote: >>>! In T137973#2440559, @Pokefan95 wrote: >>>>! In T137973#244... [10:08:52] (03CR) 10jenkins-bot: [V: 04-1] WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (owner: 10Mobrovac) [10:09:59] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Dumps-Generation, 10Elasticsearch: Link "current" to last dump set on cyrrussearch get a 404 - https://phabricator.wikimedia.org/T138176#2452024 (10fgiunchedi) p:05Triage>03Normal +dumps, @ArielGlenn perhaps? [10:10:20] 06Operations, 06Discovery, 06Maps, 10Maps-data: Improve automation around Maps servers - https://phabricator.wikimedia.org/T138017#2452028 (10fgiunchedi) p:05Triage>03Normal [10:10:38] 06Operations, 06Discovery, 06Maps, 10Maps-data, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2452030 (10fgiunchedi) p:05Triage>03Normal [10:11:42] (03PS2) 10Mobrovac: WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 [10:11:57] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [10:12:04] jynus: is 'hosts' in https://phabricator.wikimedia.org/T138810 appservers or dbs or sth else? [10:12:52] everything that is on depends on s* dbs [10:12:55] *or [10:13:15] (03CR) 10jenkins-bot: [V: 04-1] WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (owner: 10Mobrovac) [10:13:32] that means either set things on read only or switchover to codfw [10:13:33] 06Operations, 10Jupyter-Hub: notebook1001 shown as DOWN in icinga, due to firewall rules - https://phabricator.wikimedia.org/T138685#2452031 (10fgiunchedi) p:05Triage>03Normal [10:13:45] let me clarify it [10:14:00] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Dumps-Generation, 10Elasticsearch: Link "current" to last dump set on cirrussearch get a 404 - https://phabricator.wikimedia.org/T138176#2452032 (10fgiunchedi) [10:14:33] jynus: thanks! [10:14:41] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2452033 (10Steinsplitter) >@Pokefan95 wrote: > If not, then try using password reset for all sites that failed. This won't work,... [10:14:58] (03PS3) 10Mobrovac: WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 [10:15:41] 06Operations, 10Mobile-Content-Service, 06Services: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#2452034 (10fgiunchedi) p:05Triage>03Normal [10:17:09] 06Operations, 07Availability: Set databases as read-only or switchover to secondary datacenter - https://phabricator.wikimedia.org/T138810#2452051 (10jcrespo) [10:17:29] godog, it is not a real "task" [10:17:43] but I mentioned several times my need for a "TODO" [10:18:06] it is a way to mark things that are pending that [10:18:56] 06Operations, 10RESTBase, 06Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken / RESTBase not discoverable - https://phabricator.wikimedia.org/T138848#2452054 (10fgiunchedi) p:05Triage>03Normal indeed, all of these seem like they would be fixed once restbase has suppor... [10:19:57] jynus: I see, yeah it popped up because it is in "needs triage" priority heh, I'll leave it up to you to triage [10:20:51] 06Operations, 07Availability: Set databases as read-only or switchover to secondary datacenter - https://phabricator.wikimedia.org/T138810#2452056 (10jcrespo) p:05Triage>03Normal [10:24:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:25:37] (03PS1) 10Giuseppe Lavagetto: service: remove service_checker tests [puppet] - 10https://gerrit.wikimedia.org/r/298438 [10:26:57] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [10:29:01] (03CR) 10Giuseppe Lavagetto: [C: 032] service: remove service_checker tests [puppet] - 10https://gerrit.wikimedia.org/r/298438 (owner: 10Giuseppe Lavagetto) [10:29:16] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:29:36] 06Operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2452066 (10ema) varnish-module is now available for testing on apt.wikimedia.org (jessie-wikimedia/backports). [10:33:58] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:37:17] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:45:49] !log terbium:~# lvextend --size +70G -r /dev/mapper/terbium--vg-root T139786 [10:45:50] T139786: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786 [10:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:17] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:50:17] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2442645 (10Joe) I already solved the problem - nutcracker was running at verbosity 5. The only remaining problem is maybe getting rid of the old logfiles. [10:54:26] _joe_: any idea on how the increased verbosity happened? the disk filled up in a day more or less [10:55:23] <_joe_> godog: well there was a change to a maint script that used redis I guess [10:55:38] <_joe_> so that nutcracker was instantly logging a lot [10:55:39] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2452093 (10fgiunchedi) p:05Triage>03Normal [10:55:48] <_joe_> but the config was like that since forever I think [10:59:32] indeed, verbosity=5 is the default [11:03:36] (03PS1) 10Urbanecm: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298441 (https://phabricator.wikimedia.org/T140015) [11:05:04] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2442645 (10fgiunchedi) ok so now verbosity is 4, agreed there is no need to tune the log retention policy for nutcracker as the title suggests? [11:05:59] (03PS2) 10Urbanecm: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298441 (https://phabricator.wikimedia.org/T140015) [11:09:03] (03PS2) 10Hashar: contint: migrate coverage report under doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/298274 (https://phabricator.wikimedia.org/T139620) [11:09:08] (03PS3) 10Hashar: contint: migrate coverage report under doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/298274 (https://phabricator.wikimedia.org/T139620) [11:09:48] (03CR) 10Hashar: "I have dropped the Depends-On. I have confirmed on my local machine the Redirect 301 acts appropriately." [puppet] - 10https://gerrit.wikimedia.org/r/298274 (https://phabricator.wikimedia.org/T139620) (owner: 10Hashar) [11:14:39] (03PS1) 10Urbanecm: HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298444 (https://phabricator.wikimedia.org/T140015) [11:14:49] mhhh there's a whole bunch of UNKNOWNs in icinga for some LVS services, reported as '(null)', _joe_ perhaps related to your change? [11:19:24] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2452145 (10hashar) @joe fixed the verbosity with: 07d8690 - //nutcracker: lower verbosity on the maintenance hosts// Apparently we have overriden the value everywhere: ``` hieradata/r... [11:22:36] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2452157 (10Mardetanha) >>! In T137337#2452020, @fgiunchedi wrote: > there were indeed many exim queued messages around the time this issue was reported, are the delays still present? yes [11:24:49] godog: does the clinic physician has some spare time to land an Apache Redirect for the ci/doc.wm.o sites please? :) [11:25:00] straightforward and I tested it locally https://gerrit.wikimedia.org/r/#/c/298274/3/modules/contint/templates/apache/integration.wikimedia.org.erb,cm [11:25:32] aim is to move https://integration.wikimedia.org/cover/ to https://doc.wikimedia.org/cover/ so we can later move doc.wm.o off of gallium (Precise host) [11:27:53] 06Operations, 10media-storage: investigate swift used space spikes since June 2016 - https://phabricator.wikimedia.org/T140075#2452167 (10fgiunchedi) [11:28:48] hashar: mhh the doc.wm.org/cover url gives 404 ? [11:28:57] yeah I havent migrated it yet :) [11:29:14] will update the Jenkins jobs to publish to the new docroot [11:29:19] and manually move the files on gallium [11:29:30] well actually I can do it right now :) [11:30:08] hashar: ok but I'm about to go to lunch, can take a look when I'm back [11:35:39] godog: ok :) [11:37:01] (03PS1) 10Yuvipanda: Factor out deletion of objects & waiting for pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298446 [11:37:38] (03CR) 10jenkins-bot: [V: 04-1] Factor out deletion of objects & waiting for pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298446 (owner: 10Yuvipanda) [11:38:41] (03PS2) 10Yuvipanda: Factor out deletion of objects & waiting for pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298446 [11:42:22] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:04] godog: dont bother. I have eventually found out we have a .htaccess to handle the redirect :) thx anyway! [11:48:44] (03Abandoned) 10Hashar: contint: migrate coverage report under doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/298274 (https://phabricator.wikimedia.org/T139620) (owner: 10Hashar) [11:49:12] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: puppet fail [11:50:03] (03PS1) 10Yuvipanda: Bypass querycache for checking status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298451 [11:51:35] (03PS2) 10Yuvipanda: Bypass querycache for checking status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298451 [11:58:51] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2452288 (10elukey) >>! In T139606#2437566, @fgiunchedi wrote: > yeah the new hardware would work too and easier to compare, I think we can grab 2x machines from appservers /cc @Jo... [12:01:15] (03PS1) 10Yuvipanda: Fix stupid logic errors in starting/stopping [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298454 [12:06:35] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:09:17] (03PS1) 10Yuvipanda: Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 [12:10:19] 07Puppet, 06Labs, 10Labs-project-Phabricator, 13Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#2452335 (10Danny_B) [12:12:10] (03PS1) 10Elukey: Change AQS cassandra cluster name to remove "Test" (aqs100[456]) [puppet] - 10https://gerrit.wikimedia.org/r/298456 [12:14:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:14:33] (03PS2) 10Yuvipanda: Fix stupid logic errors in starting/stopping [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298454 [12:14:35] (03PS2) 10Yuvipanda: Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 [12:14:55] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:22:29] (03PS1) 10Yuvipanda: Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 [12:23:09] (03CR) 10jenkins-bot: [V: 04-1] Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 (owner: 10Yuvipanda) [12:26:16] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:26:19] (03PS2) 10Yuvipanda: Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 [12:26:30] (03Abandoned) 10Yuvipanda: Bypass querycache for checking status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298451 (owner: 10Yuvipanda) [12:27:49] (03CR) 10Yuvipanda: [C: 04-2] "Yup, let's kill /nodes instead!" [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [12:28:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5164751 keys - replication_delay is 0 [12:36:21] (03PS3) 10Addshore: Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) [12:38:58] (03PS1) 10Yuvipanda: Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 [12:46:48] (03PS4) 10Addshore: Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) [12:47:26] (03PS1) 10BBlack: VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 [12:50:05] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2452441 (10faidon) Perhaps we should wait until Varnish 5.0, supposedly out in a couple of months and with HTTP/2.0 support, before we proceed with this change? [12:51:41] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2452455 (10faidon) I don't particularly object into either moving port 80 to `sh` or to nginx, but I don't think that TFO on port 80 will make any kind of performance impact — at le... [12:53:20] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2452456 (10faidon) We (@Tbayer mostly) have asked for this repeatedly, to no avail. There was a thread with comms that hasn't seen any activity lately, I'll ping again… [12:53:21] (03CR) 10BBlack: [C: 04-1] "Looks good overall, just the one nit noted in comments about a missing brace, I think." (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 (owner: 10Elukey) [12:56:02] yuvipanda: about dropping /nodes/ from puppet.git, any help is welcome :) [12:56:37] yuvipanda: I have no idea how on labs to have a Hiera regex rule against the FQDN [12:56:48] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2452463 (10BBlack) @faidon - so far we've seen TFO stats showing more TFO failures than successes, so we're looking for reasons why TFO so commonly fails when attempted. That port... [13:04:22] (03PS4) 10Mobrovac: WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 [13:05:42] !log restarting nodemanagers on analytics 1039 1046 and 1054 [13:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:30] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2452527 (10BBlack) It will be another 6 months before we're even settled into a full Varnish 4 world. We have several major followup projects pending on that (e.g. xkey, and f... [13:10:38] (03PS5) 10Ottomata: Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) (owner: 10Addshore) [13:10:59] (03CR) 10Ottomata: [C: 032 V: 032] Update analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298424 (https://phabricator.wikimedia.org/T140064) (owner: 10Addshore) [13:14:05] (03PS10) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [13:14:58] (03CR) 10BBlack: [C: 031] Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 (owner: 10Elukey) [13:16:00] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2452575 (10Joe) [13:16:30] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2452577 (10Joe) [13:18:00] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:00] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2452594 (10faidon) The only requirement from my side would be to use a granularity limiter ("g=donate" or "g=donate*" perhaps?), which I'm guess... [13:18:31] (03CR) 10Elukey: [C: 032 V: 032] Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 (owner: 10Elukey) [13:20:10] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:34:55] (03PS1) 10Yuvipanda: Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 [13:36:12] (03PS2) 10Yuvipanda: Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 [13:38:49] PROBLEM - Disk space on elastic2006 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 108659 MB (15% inode=99%) [13:41:06] 06Operations, 06Project-Admins, 05WMF-NDA: Project proposal: WMF-NDA - https://phabricator.wikimedia.org/T1051#2452777 (10Danny_B) [13:41:59] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2452782 (10Jgreen) Thanks @dpatrick & @faidon. @CCogdill_WMF so the next step--could you get a statement from Silverpop to the effect of what dp... [13:49:25] (03PS1) 10Addshore: statistics::wmde ensure 'production' branch latest [puppet] - 10https://gerrit.wikimedia.org/r/298474 [13:49:35] !log cache nodes: apt-get upgrade to latest (just 3.16 kernel) [13:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:52] (03CR) 10Ottomata: [C: 032] statistics::wmde ensure 'production' branch latest [puppet] - 10https://gerrit.wikimedia.org/r/298474 (owner: 10Addshore) [13:51:14] well ottomata that is probably nicer than bumping the hash over and over again too! [13:52:00] 06Operations, 06Collaboration-Team-Interested, 06Developer-Relations, 06Editing-Department, and 13 others: Create team projects for all teams participating in scrum of scrums - https://phabricator.wikimedia.org/T1211#2452816 (10Danny_B) [13:52:02] !log lvs nodes: apt-get upgrade to latest (various base system packages) [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:24] !next [13:54:43] (03PS6) 10Reedy: Swap to using extension.json where it exists in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298054 (https://phabricator.wikimedia.org/T139800) [13:55:01] (03CR) 10Reedy: [C: 032] "To test on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298054 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [13:55:45] (03Merged) 10jenkins-bot: Swap to using extension.json where it exists in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298054 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [13:58:49] 06Operations, 06Project-Admins: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#2452905 (10Danny_B) [14:00:59] RECOVERY - Disk space on elastic2006 is OK: DISK OK [14:03:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298476 (https://phabricator.wikimedia.org/T128546) [14:04:18] !log lvs nodes: apt-get install linux-meta [14:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:29] (03CR) 10Faidon Liambotis: "2/3 :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [14:04:35] 06Operations, 06Project-Admins: create procurement project - https://phabricator.wikimedia.org/T93796#2452947 (10Danny_B) [14:05:21] hashar: ok! [14:05:55] (03PS1) 10Addshore: Fix final few paths in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/298477 [14:07:27] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2452957 (10hashar) doc.wikimedia.org home is tracked via T137890 [14:09:11] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2452978 (10fgiunchedi) yeah I agree if the default verbosity of 5 (also nutcracker's) leads to useless/too verbose logging it should be lowered to 4 even in the module [14:10:21] (03CR) 10Ema: [C: 031] VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 (owner: 10BBlack) [14:11:43] 06Operations, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium): Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890#2452985 (10hashar) I have updated the task with a basic overview. The doc is generated on labs instances and rsync ed to the labs... [14:11:57] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 (owner: 10Faidon Liambotis) [14:13:33] (03PS2) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 [14:14:38] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [14:16:23] _joe_: neon is reporting failures to execute '/usr/bin/check-service-swagger', no such file or directory [14:16:30] (03CR) 10jenkins-bot: [V: 04-1] puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 (owner: 10Giuseppe Lavagetto) [14:16:42] <_joe_> godog: wat? [14:17:05] yeah, also the service-checker related services are reported as UNKNOWN [14:17:33] !log reedy@tin Synchronized wmf-config/extension-list: moar extension.json (duration: 00m 26s) [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:39] !log nginx 1.11.2-1+wmf1 uploaded to carbon [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:01] <_joe_> godog, meh, it's /usr/bin/service-checker-swagger [14:18:03] <_joe_> fixing [14:19:05] heh, almost [14:19:08] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [14:19:52] (03PS1) 10Giuseppe Lavagetto: icinga: fix command name [puppet] - 10https://gerrit.wikimedia.org/r/298478 [14:20:09] (03PS2) 10BBlack: VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 [14:20:17] (03CR) 10BBlack: [C: 032] VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 (owner: 10BBlack) [14:20:43] (03PS2) 10Reedy: Use extension.json in extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298238 (https://phabricator.wikimedia.org/T139800) [14:20:53] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T140101#2453018 (10fgiunchedi) 03NEW [14:20:53] (03CR) 10Reedy: [C: 032] Use extension.json in extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298238 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [14:21:23] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: fix command name [puppet] - 10https://gerrit.wikimedia.org/r/298478 (owner: 10Giuseppe Lavagetto) [14:21:31] (03Merged) 10jenkins-bot: Use extension.json in extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298238 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [14:21:43] (03PS3) 10BBlack: VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 [14:21:51] (03CR) 10BBlack: [V: 032] VCL: raise 4xx TTL cap from 1m to 10m [puppet] - 10https://gerrit.wikimedia.org/r/298467 (owner: 10BBlack) [14:23:06] ACKNOWLEDGEMENT - MegaRAID on ms-be1012 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdh broken, T140101 [14:23:06] ACKNOWLEDGEMENT - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdh broken, T140101 [14:23:09] (03PS2) 10Addshore: Fix final few paths in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/298477 [14:23:15] !log reedy@tin Synchronized wmf-config/extension-list: even more extension.json (duration: 00m 26s) [14:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:37] (03PS2) 10Elukey: Change AQS cassandra cluster name to remove "Test" (aqs100[456]) [puppet] - 10https://gerrit.wikimedia.org/r/298456 [14:25:07] (03CR) 10Ottomata: [C: 032] Fix final few paths in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/298477 (owner: 10Addshore) [14:31:28] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Dumps-Generation, 10Elasticsearch: Link "current" to last dump set on cirrussearch get a 404 - https://phabricator.wikimedia.org/T138176#2453078 (10ArielGlenn) Yep this is mine. There's something wrong with the script still, I'll have a look. [14:31:46] (03PS3) 10Elukey: Change AQS cassandra cluster name to remove "Test" (aqs100[456]) [puppet] - 10https://gerrit.wikimedia.org/r/298456 [14:31:51] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Dumps-Generation, 10Elasticsearch: Link "current" to last dump set on cirrussearch get a 404 - https://phabricator.wikimedia.org/T138176#2453079 (10ArielGlenn) a:03ArielGlenn [14:32:32] (03CR) 10Ottomata: [C: 031] Change AQS cassandra cluster name to remove "Test" (aqs100[456]) [puppet] - 10https://gerrit.wikimedia.org/r/298456 (owner: 10Elukey) [14:32:40] (03PS1) 10Andrew Bogott: Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 [14:33:42] !log Rebuild new AQS Cassandra cluster (aqs100[456]) to remove previous testing settings (no prod traffic is served) [14:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:09] (03CR) 10Elukey: [C: 032] Change AQS cassandra cluster name to remove "Test" (aqs100[456]) [puppet] - 10https://gerrit.wikimedia.org/r/298456 (owner: 10Elukey) [14:39:50] ouch I didn't see a cdh submodule change in my commit snap [14:40:04] fixing it asap, will keep palladium puppet-merge blocked a second [14:40:05] sorry [14:42:40] (03PS1) 10Elukey: Fix not intended rollback of cdh module in my previous commit [puppet] - 10https://gerrit.wikimedia.org/r/298482 [14:43:26] (03CR) 10Elukey: [C: 032 V: 032] Fix not intended rollback of cdh module in my previous commit [puppet] - 10https://gerrit.wikimedia.org/r/298482 (owner: 10Elukey) [14:44:22] all good unblocked [14:46:43] (03PS3) 10BBlack: lvs: rate-limit more ICMP codes, lower to 1/200ms [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [14:48:37] jouncebot: next [14:48:38] In 0 hour(s) and 11 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T1500) [14:49:10] (03PS1) 10Jgreen: Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) [14:49:44] (03CR) 10jenkins-bot: [V: 04-1] Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:51:33] (03CR) 10BBlack: [C: 032] lvs: rate-limit more ICMP codes, lower to 1/200ms [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [14:52:05] (03PS2) 10Jgreen: Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) [14:52:32] (03CR) 10jenkins-bot: [V: 04-1] Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:56:02] (03CR) 10BBlack: [C: 032 V: 032] Add nginx.org ubsan shift patches [software/nginx] (wmf-1.11.2) - 10https://gerrit.wikimedia.org/r/298350 (owner: 10BBlack) [14:56:12] (03CR) 10BBlack: [C: 032 V: 032] Add Cloudflare TLS dynamic record sizing [software/nginx] (wmf-1.11.2) - 10https://gerrit.wikimedia.org/r/298351 (owner: 10BBlack) [14:56:23] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.2-1+wmf1) jessie; urgency=medium [software/nginx] (wmf-1.11.2) - 10https://gerrit.wikimedia.org/r/298352 (owner: 10BBlack) [14:58:17] RECOVERY - Disk space on ms-be3001 is OK: DISK OK [14:58:26] !log upgrading nginx to 1.11.2-1+wmf1 on cache_maps [14:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:20] bblack: btw, had a chat with the openssl maintainer at debconf [14:59:33] paravoid: any news on 1.1.0? [14:59:51] that's in progress, it just needs fixes all across the archive (including e.g. hhvm) [15:00:04] thcipriani and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T1500). Please do the needful. [15:00:04] Urbanecm and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] Around [15:00:13] (03CR) 10Filippo Giunchedi: [C: 031] package_builder: install WMF lintian profile file [puppet] - 10https://gerrit.wikimedia.org/r/298286 (owner: 10Ema) [15:00:14] the plan is for stretch to be released with 1.1.0 [15:00:49] 0/ [15:01:05] zeljkof: and I are pairing for SWAT today. [15:01:05] (03PS1) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [15:01:16] bblack: see http://bugs.debian.org/827061 [15:01:20] but anyway [15:01:21] in other news [15:01:30] thcipriani, if you want, I can deploy the wmf9 patch myself, since wmf9 is not currently deployed anyway. [15:01:34] (03PS2) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [15:01:47] bblack: libssl 1.0.2 was uploaded to jessie-backports [15:01:47] I don't know why it didn't ping me. [15:01:58] bblack: (try an "apt-cache policy libssl1.0.0" on a cp* box) [15:02:13] bblack: so we could switch to that [15:02:23] oh nice [15:02:28] the real upstream jessie-backports? [15:02:30] yes [15:02:39] and I also chatted with the nginx maintainer (Christos) [15:02:43] and let him know that [15:02:49] (03CR) 10jenkins-bot: [V: 04-1] Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [15:02:54] so he's thinking of uploading an nginx with a libssl 1.0.2 dependency and thus H2 [15:02:58] into jessie-backports [15:03:12] matt_flaschen: I can deploy that as part of SWAT, is there a link? [15:03:17] we still need the cloudflare patch and I just you pushed a ubsan patch though [15:03:24] yeah, they've (nginx debian maint) also started doing experimental 1.11 packages too [15:03:25] so that doesn't help us all that much I suppose :) [15:03:32] (03PS3) 10Thcipriani: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298441 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [15:03:41] (03PS3) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [15:03:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298441 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [15:03:52] well, for 1.11.1 we based on debian's 1.10 work and added 1.11 ourselves (and CF patch) [15:04:12] for 1.11.2, I rebased us onto the debian experimental branch's work for 1.11.2-1~exp1, and then added ours [15:04:38] (03Merged) 10jenkins-bot: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298441 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [15:04:38] I think either way, we won't easily get away from doing our own nginx package [15:04:48] thcipriani, ag, sorry, https://gerrit.wikimedia.org/r/#/c/298429/1 , will add. [15:05:03] but getting openssl from upstream will be nice, one less package for us (moritz really) to maintain locally [15:05:15] assuming they sec-patch backports quickly [15:05:22] (03PS4) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [15:05:33] (03PS1) 10Cmjohnson: Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 [15:05:58] (03CR) 10jenkins-bot: [V: 04-1] Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [15:06:48] (03PS2) 10Ema: package_builder: install WMF lintian profile file [puppet] - 10https://gerrit.wikimedia.org/r/298286 [15:06:57] RECOVERY - Disk space on ms-be3002 is OK: DISK OK [15:06:58] (03CR) 10Ema: [C: 032 V: 032] package_builder: install WMF lintian profile file [puppet] - 10https://gerrit.wikimedia.org/r/298286 (owner: 10Ema) [15:07:15] RECOVERY - Disk space on ms-be3003 is OK: DISK OK [15:07:42] bblack: jessie-backports isn't security-supported in theory [15:07:48] but Kurt said he'd maintain it there [15:07:57] cmjohnson1: Hi! [15:08:00] and Moritz could probably do the security uploads there too if he wanted :) [15:08:28] true! [15:09:11] are the ubsan patches from upstream? [15:09:15] (03PS3) 10Jgreen: Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) [15:09:19] are they just backported from master or something? [15:09:24] !log thcipriani@tin Synchronized static/images/project-logos/trwikimedia.png: SWAT: [[gerrit:298441|Logo update for trwikimedia (T140015)]] (duration: 00m 33s) [15:09:25] T140015: tr.wikimedia.org logo update - https://phabricator.wikimedia.org/T140015 [15:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:49] paravoid: yeah I cherry-picked them from hg.nginx.org master, they're after the latest (1.11.2) release [15:10:00] cool [15:10:11] ^ Urbanecm check logo please, should be purged. [15:10:11] so next release we can kill them [15:10:23] so it's just the adaptive TLS that we need now right? [15:10:31] I wonder if that could be something that the Debian maintainer could add ;) [15:10:34] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T140101#2453316 (10fgiunchedi) p:05Triage>03Normal [15:11:57] thcipriani, Logo is working but there is a blue space around it. Do you have an idea how to remove it? [15:12:23] Urbanecm: blue space? I'm not seeing that. I do see a non-transparent background. [15:12:28] is the logo the correct size? [15:12:28] paravoid: yeah I donno, putting on a more-conservative hat, it's controversial how the default tuning works and whether it's a net benefit over, say, using a fixed 4K record size for the usual case... [15:12:59] paravoid: If I use my crystal ball, I expect sometime in the next few months nginx.org 1.11.x will gain dynamic record sizing, but they'll write their own fresh patch and make it work better/different. [15:13:11] * Urbanecm preparing a screenshot [15:14:15] paravoid: so adding that patch now in an official deb, will probably cause headaches down the road, because maybe the syntax for the options changes in a point release in a way that's not forward or backward compat for the restart. [15:14:54] I can see it, see http://urbanecm.8u.cz/wikipedia/logoTrwikimedia.png [15:15:07] 06Operations, 10Incident-20151216-Labs-NFS, 06Labs: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2453330 (10fgiunchedi) adding labs too, ATM this is the situation kernel-wise: ``` $ ssh labstore1001.eqiad.wmnet uname -a Linux labstore1001 3.1... [15:15:22] Urbanecm: that's what I see as well. Is that acceptable? Or should we revert? [15:15:48] I think it should be removed but I have no idea how to do it. [15:16:50] (03PS5) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [15:17:11] Urbanecm: I am going to revert for now, the transparent background should be part of the png. I'm not sure how best to do that either, a designer should probably be poked. [15:17:17] p858snake: I don't think a size do something with a background. [15:17:31] (03CR) 10Addshore: [C: 031] Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [15:18:08] thcipriani: Okay. Maybe the background is a part of SVG which I converted to PNG of certain size. [15:18:26] !log thcipriani@tin Synchronized static/images/project-logos/trwikimedia.png: SWAT: Revert [[gerrit:298441|Logo update for trwikimedia (T140015)]] (duration: 00m 29s) [15:18:27] T140015: tr.wikimedia.org logo update - https://phabricator.wikimedia.org/T140015 [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:36] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:02] (03PS3) 10Alex Monk: deployment-prep: Point upload cache at swift, fix rewrite.py to use beta.wmflabs.org domains [puppet] - 10https://gerrit.wikimedia.org/r/298297 (https://phabricator.wikimedia.org/T64835) [15:19:04] (03PS2) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298476 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:19:41] thcipriani, I'll ask them if they want it this way. If not, I'll try to find out how to remove the background from the logo. [15:19:52] Urbanecm: sounds good, thank you. [15:20:36] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:20:40] thcipriani, So please cancel my next patch because it depends on this one. [15:20:57] !log upgrading nginx to 1.11.2-1+wmf1 on all caches [15:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:03] Urbanecm: yup, I will make a note. [15:21:11] (03PS3) 10Faidon Liambotis: base: remove ioscheduler setting from non-augeas codepath [puppet] - 10https://gerrit.wikimedia.org/r/296727 [15:21:13] (03PS3) 10Faidon Liambotis: labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 [15:21:15] (03PS3) 10Faidon Liambotis: cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 [15:21:17] (03PS3) 10Faidon Liambotis: Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 [15:21:19] (03PS2) 10Faidon Liambotis: base: reenable augeas codepath on trustys [puppet] - 10https://gerrit.wikimedia.org/r/296728 [15:21:19] thcipriani, thanks [15:21:23] (03PS3) 10Faidon Liambotis: mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 [15:21:25] (03PS1) 10Faidon Liambotis: base: do not include grub in Labs Ubuntus [puppet] - 10https://gerrit.wikimedia.org/r/298490 [15:21:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298476 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:22:51] hmm, zuul's moving a little slowly... [15:22:57] (03PS2) 10BBlack: Insecure POST: 20% fail for labs, 100% for external [puppet] - 10https://gerrit.wikimedia.org/r/298336 (https://phabricator.wikimedia.org/T136674) [15:23:46] (03CR) 10Giuseppe Lavagetto: [C: 031] base: do not include grub in Labs Ubuntus [puppet] - 10https://gerrit.wikimedia.org/r/298490 (owner: 10Faidon Liambotis) [15:24:13] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298476 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:24:32] (03CR) 10Giuseppe Lavagetto: [C: 031] base: reenable augeas codepath on trustys [puppet] - 10https://gerrit.wikimedia.org/r/296728 (owner: 10Faidon Liambotis) [15:25:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:25:40] (03PS1) 10Faidon Liambotis: Revert "spf records for wikipedia.org, see T135410" [dns] - 10https://gerrit.wikimedia.org/r/298492 [15:26:02] (03CR) 10Faidon Liambotis: [C: 032] Revert "spf records for wikipedia.org, see T135410" [dns] - 10https://gerrit.wikimedia.org/r/298492 (owner: 10Faidon Liambotis) [15:26:16] (03PS1) 10Thcipriani: Revert "Logo update for trwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298493 [15:26:17] elukey: what's up? [15:26:24] (03CR) 10BBlack: [C: 032 V: 032] Insecure POST: 20% fail for labs, 100% for external [puppet] - 10https://gerrit.wikimedia.org/r/298336 (https://phabricator.wikimedia.org/T136674) (owner: 10BBlack) [15:26:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298493 (owner: 10Thcipriani) [15:27:01] (03PS1) 10Aklapper: Allow aklapper to delete files in Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/298494 [15:27:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:27:40] (03Merged) 10jenkins-bot: Revert "Logo update for trwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298493 (owner: 10Thcipriani) [15:28:23] (03CR) 10Aklapper: "Usecase: https://phabricator.wikimedia.org/F4224812" [puppet] - 10https://gerrit.wikimedia.org/r/298494 (owner: 10Aklapper) [15:28:33] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:298476|Bumping portals to master (T128546)]] (duration: 00m 29s) [15:28:34] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [15:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:42] (03PS2) 10Andrew Bogott: Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 [15:28:52] (03PS2) 10Faidon Liambotis: Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [15:29:03] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:298476|Bumping portals to master (T128546)]] (duration: 00m 29s) [15:29:04] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [15:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:10] ^ jan_drewniak check please [15:29:55] thcipriani: looks good, thanks! [15:29:55] (03CR) 10jenkins-bot: [V: 04-1] Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [15:30:05] jan_drewniak: cool, thanks for checking! [15:33:34] !log thcipriani@tin Synchronized php-1.28.0-wmf.9/extensions/Echo/includes/ForeignWikiRequest.php: SWAT: [[gerrit:298429|ForeignWikiRequest: Bail early for non-global users (T119736)]] (duration: 00m 31s) [15:33:35] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:42] ^ matt_flaschen sync'd! [15:34:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Seems good, I just have a doubt about the glob parameter of grub::bootparam." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [15:34:23] thcipriani, thanks. [15:37:22] cmjohnson1: do you have time to chat about analytics1049 later on? [15:37:24] :) [15:37:25] T137273 [15:37:25] T137273: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273 [15:37:34] --^ [15:37:39] do you know the slot number? [15:37:45] I can swap it now [15:39:13] 06Operations: graceful of apaches randomly fails on check-time - https://phabricator.wikimedia.org/T83275#2453467 (10fgiunchedi) [15:39:31] 06Operations: graceful of apaches randomly fails on check-time - https://phabricator.wikimedia.org/T83275#911809 (10fgiunchedi) 05Open>03Invalid I don't think it is, tentatively resolving [15:40:14] 06Operations, 07Graphite, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2453476 (10Addshore) [15:41:13] (03PS1) 10BBlack: cache perf: remove vm compaction cron [puppet] - 10https://gerrit.wikimedia.org/r/298499 [15:42:34] (03CR) 10BBlack: [C: 032] cache perf: remove vm compaction cron [puppet] - 10https://gerrit.wikimedia.org/r/298499 (owner: 10BBlack) [15:44:43] greg-g, are you going to send out an email about the revert to wmf.8? [15:44:48] 06Operations: Come up with key performance indicators (KPIs) - https://phabricator.wikimedia.org/T784#2453488 (10fgiunchedi) 05Open>03Invalid tentatively closing as invalid, looks like we are not going to do anything with it [15:44:49] !log cache nodes: salt manual removal of vm compaction cron via sed ( https://gerrit.wikimedia.org/r/298499 ) [15:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:57] (03PS1) 10Jgreen: SPF record for wikipedia.org and domains sharing that zonefile. [dns] - 10https://gerrit.wikimedia.org/r/298500 (https://phabricator.wikimedia.org/T135410) [15:46:19] cmjohnson1: nope, checking if I can grab it from the mgmt console [15:48:20] cmjohnson1: we know it isn't slot 1 :p [15:48:23] 06Operations: graphite2001 bios config issue - https://phabricator.wikimedia.org/T100959#2453502 (10fgiunchedi) 05Open>03declined I've seen `error: diskfilter writes are not supported.` before though it is harmless afaict. I can see graphite2001's console so that part is working. [15:48:26] the machine won't boot back up [15:48:30] so we can't really tell [15:48:53] PROBLEM - MariaDB Slave SQL: x1 on db1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:57] elukey and ottomata: I am not sure who or how but all your disks are gone [15:49:28] it won't boot because the config is missing..needs to be reinstalled [15:49:56] :/ [15:50:00] x1 master down? [15:50:54] RECOVERY - MariaDB Slave SQL: x1 on db1031 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:51:02] jynus: might be transient/icinga overloaded [15:51:06] ah there you go [15:51:12] no [15:51:13] <_joe_> godog: nope it was a real issue [15:51:14] it is overloaded [15:51:32] indeed [15:51:35] I can login on port 3307, but it does not respond [15:52:06] elukey: during post you can ctrl-r or esc-shift-R at the RAID prompt and rebuild the raid [15:52:06] cx_translations has broken it [15:52:19] 10000 connections in Update state [15:52:30] ContentTranslation\Translation::update [15:52:34] cmjohnson1: but there is no raid.. I think that those are single disk RAID0 [15:52:43] 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001#2453582 (10bd808) Applying the default mapping change helped, but we still have several conflicting mappings i... [15:53:27] cmjohnson1: we have this weird config of single drives configured in RAID-0 array of one virtual drive each [15:53:29] yes and no...iirc the h/w setup each disk is setup to be jbod but ottomata would remember better [15:53:51] can we disable this plugin? [15:53:58] it is all done through the raid controller (elukey) [15:55:39] ostriches, can we disable content translation from all wikis- echo, flow and everything else will be broken unless we do it? [15:55:51] cmjohnson1: would I be able to see the failed disk slot in the configuration utility? [15:56:00] (the one that prompts during boot) [15:56:14] ostriches: I know nothing about content translation... [15:56:22] no, no disk presented as failed it only showed no disk configuration present [15:56:38] (03CR) 10BryanDavis: [C: 032] Factor out deletion of objects & waiting for pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298446 (owner: 10Yuvipanda) [15:56:50] Why did I say that to myself? [15:56:57] jynus: I know nothing about content translation. [15:57:04] cmjohnson1: but I am sure that one disk failed.. the is a Foreign one in PD Mgmt [15:57:08] does anyone here know something about mediawiki? [15:57:08] (03PS1) 10Ema: Revert "cache_upload: hack around a network load problem..." [puppet] - 10https://gerrit.wikimedia.org/r/298502 [15:57:10] (03Merged) 10jenkins-bot: Factor out deletion of objects & waiting for pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298446 (owner: 10Yuvipanda) [15:57:21] jynus: I know lots about MW [15:57:32] (just zilch about content translation, which is an extension) [15:57:36] (03CR) 10BryanDavis: [C: 032] Fix stupid logic errors in starting/stopping [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298454 (owner: 10Yuvipanda) [15:57:37] elukey [15:57:41] let me drive for a min [15:58:00] sure better :) [15:58:05] mgmt is yours [15:58:07] (03Merged) 10jenkins-bot: Fix stupid logic errors in starting/stopping [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298454 (owner: 10Yuvipanda) [15:58:48] jynus: What are you wanting to do? [15:58:54] (03CR) 10BBlack: [C: 031] Revert "cache_upload: hack around a network load problem..." [puppet] - 10https://gerrit.wikimedia.org/r/298502 (owner: 10Ema) [15:59:25] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:49] https://phabricator.wikimedia.org/T140123 [15:59:54] there is an outage going in [15:59:55] cmjohnson1: those nodes are single disk RAID0 [16:00:00] for each disk [16:00:03] the kafka ones are JBOD [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T1600). [16:00:04] Hashar and Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:07] i have no idea why they are like that [16:00:09] they just are :/ [16:00:23] matt_flaschen: ori said he was last night, I just go on my computer, what's the status with the patches/fixes? [16:00:36] (03CR) 10Ema: [C: 032 V: 032] Revert "cache_upload: hack around a network load problem..." [puppet] - 10https://gerrit.wikimedia.org/r/298502 (owner: 10Ema) [16:00:53] jynus: ok yeah I got that. Disabling the extension? [16:01:04] matt_flaschen: I'm going into a meeting. please treat this issue as UBN and get the help you need. [16:01:34] elukey: the disk is ready [16:01:44] should be able to add from this screen [16:02:04] I'll wait for puppet swat until we understand the x1 issue better [16:02:13] https://usercontent.irccloud-cdn.com/file/qoCfhVBd/ [16:02:34] (03PS1) 10BBlack: gdnsd config: zones_default_ttl = 3600 [puppet] - 10https://gerrit.wikimedia.org/r/298504 [16:02:38] greg-g, we're back on wmf.8 everywhere, so at least the Echo cause should be eliminated just from that. RoanKattouw wrote a patch which is merged to master and now deployed to wmf.9. [16:02:39] people are saying to me fa.wp is down [16:03:02] greg-g, I think there are two underlying causes, one Echo one separate, based on the dates. [16:03:02] jynus: We can disable it. [16:03:24] (03PS3) 10Andrew Bogott: Change ram_allocation_ratio to 1.2 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) [16:03:26] (03PS1) 10Andrew Bogott: cold-migrate: update 'node' db setting as well as 'host'. [puppet] - 10https://gerrit.wikimedia.org/r/298507 [16:03:28] please disable it everywhere, ostriches [16:03:28] (03PS1) 10Andrew Bogott: Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) [16:03:38] or anyone that can do [16:03:46] jynus: {{doing}} [16:03:48] Uno momento [16:04:23] (03PS1) 10Chad: Disable content translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298510 (https://phabricator.wikimedia.org/T140123) [16:04:42] Nikerabbit: kart_: santhosh: ^ [16:04:53] elukey: did you recover this? [16:05:00] MatmaRex: Thanks, that was my next ping. [16:05:05] (03CR) 10Jcrespo: [C: 031] Disable content translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298510 (https://phabricator.wikimedia.org/T140123) (owner: 10Chad) [16:05:08] matt_flaschen: AIUI switching to wmf8 does not affect this, only the Echo transition flags do [16:05:08] cmjohnson1: sorry Chris I didn't get what I should do [16:05:17] (be patient please) [16:06:07] no, i am wondering if you recovered the config or rebuilt it? [16:06:38] matt_flaschen: greg-g: So I don't think the Echo cause is eliminated until my patch is backported to wmf8 [16:06:46] may have to use megacli to add the disk group back [16:07:01] RoanKattouw: {{approved}} for backporting then. [16:07:09] (03CR) 10Chad: [C: 032] Disable content translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298510 (https://phabricator.wikimedia.org/T140123) (owner: 10Chad) [16:07:30] (03CR) 10BryanDavis: Take status of pod into account as well for webservice status (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 (owner: 10Yuvipanda) [16:07:49] ostriches: interesting commit description [16:08:16] cmjohnson1: didn't do anything up to now [16:08:20] Luke081515: It does what it says :) [16:08:23] okay [16:08:48] ostriches: Thanks. matt_flaschen could you do the backport then? [16:08:57] ostriches: I mean the description, not the title :D. What do you want to test? [16:09:16] RoanKattouw, yeah. Sorry, I thought the flags weren't checked in wmf8. [16:09:33] Luke081515: Nothing for me personally. Just figured it'd be best to leave it on there in case there needs to be any testing for root cause analysis. [16:09:40] umm what? [16:10:02] Nikerabbit: tldr: Content Translation is bringing x1 down. jynus wants it disabled for the time being. [16:10:14] (which I am in the process of doing) [16:10:25] hm, ok [16:10:26] Nikerabbit, content translation was creating a full outage on all wikis [16:10:46] (a partial outage) [16:10:52] o.O [16:10:56] what a nice coincidence that this happens when we are celebrating our 100000th translation. [16:11:00] something I can help with? [16:11:11] fix https://phabricator.wikimedia.org/T140123#2453720 [16:11:16] then we can reenable [16:11:43] I can share the user in provate if it helps debugging [16:12:01] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Disable content translation, outage right now (duration: 00m 29s) [16:12:03] is there a stacktrace? [16:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:12] Nikerabbit, jynus ^^^^ [16:12:22] Nikerabbit, it was not a mediawiki error [16:12:30] (I left it enabled on testwiki for now, for testing) [16:12:36] (Figured that might help you Nikerabbit) [16:12:40] (03CR) 10Alex Monk: [C: 031] "what about update, abandon and touch?" [puppet] - 10https://gerrit.wikimedia.org/r/298280 (owner: 10Andrew Bogott) [16:13:00] several queres were running for 1419 seconds [16:13:17] several == thousands [16:13:29] https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=da58dc0b53791be4d3bda4ef4224b952&host=family%3Adb1031&user=&schema=&hours=1 [16:13:45] was there a slow query or did something just get stuck, causing all following queries to pile up? [16:13:58] greg-g, will deploy the backport to wmf8 now. [16:14:00] (03CR) 10BryanDavis: [C: 031] "A nit inline about using long command options when available to improve readability." (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 (owner: 10Yuvipanda) [16:14:31] (03PS1) 10BBlack: remove default ttl/origin from top of all zonefiles [dns] - 10https://gerrit.wikimedia.org/r/298513 [16:14:33] (03PS1) 10BBlack: remove needless {{ zonename }} templating [dns] - 10https://gerrit.wikimedia.org/r/298514 [16:14:35] (03PS1) 10BBlack: wmnet: explicit full $ORIGIN statements [dns] - 10https://gerrit.wikimedia.org/r/298515 [16:14:35] elukey: an1049 is up [16:14:37] (03CR) 10BryanDavis: [C: 031] Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 (owner: 10Yuvipanda) [16:14:58] (03Abandoned) 10Jgreen: Remove gratuitous donate.wiki[mp]edia.org SPF records. [dns] - 10https://gerrit.wikimedia.org/r/298484 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [16:15:18] as far as I can see the queries where running [16:15:48] (03CR) 10BryanDavis: [C: 04-1] Permit doing webservice shell for k8s with a running ge job (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 (owner: 10Yuvipanda) [16:15:56] cmjohnson1: thankssssss [16:16:17] all started to fail when we reached max_connections [16:16:57] causing 85451 requests to fail from echo, flow, etc. (everithing on x1) [16:17:46] That looks like an UPDATE with a PK condition, how could that possibly be slow? [16:18:04] I had another outage ongoing, you can blame me for taking the decision on disabling it ( I think it was the right thing to do) [16:18:12] but please follow up [16:18:37] as far as I can see from the code, we are not calling that function in a loop, so the only way for that to happen would be someone to hit the API with thousands of requests [16:18:38] thanks, ostriches for the help, much appreciated for your fast response [16:18:43] is it possible to confirm if this happened? [16:18:48] Thanks matt_flaschen . The flags are controlled in wmf-config and the code respecting them was merged early so it could be tested [16:19:12] (03PS2) 10Andrew Bogott: cold-migrate: update 'node' db setting as well as 'host'. [puppet] - 10https://gerrit.wikimedia.org/r/298507 [16:19:13] Nikerabbit, that probably can be answer by anyone- allow me to attend the other ongoing issues [16:19:31] (I am not a blocker to reenable the extension, but I would beg for some investigation first) [16:19:43] RoanKattouw, yeah, I should have checked when they were added. I misunderstood that the wmf.8 revert was enough to work around the problem, and we just had to fix it before going back to wmf.9. [16:20:08] well, would someone help me to answer that question? [16:20:47] Ironically, I doubt if the revert helped at all [16:20:55] But the logs will tell you that [16:21:31] RoanKattouw, I could kill the queries knowing that they will not happen again [16:21:53] oh, sorry, RoanKattouw, wrong person [16:21:57] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2453855 (10Cmjohnson) It appears that a disk was in a foregin cfg mode. Cleared the foreign cfg and cache. Added the disk group back. [16:22:43] cmjohnson1: just to understand - did you replace the disk right? [16:22:50] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2453876 (10Cmjohnson) Return shipment of the first disk FEDEX 9611918 2393026 70283562 UPS 9202 3946 5301 2421 0335 48 [16:23:21] cmjohnson1: ah no just seen the phab task upgrades [16:23:24] *updates [16:23:38] elukey: no disk was replaced....the error may very well have been from the last disk I replaced. Appears to have been in a foreign state [16:23:40] (03PS20) 10Filippo Giunchedi: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10Coren) [16:24:24] cmjohnson1: yep yep now I got why you asked me if I set anything previously [16:24:27] okok thanks :) [16:24:45] matt_flaschen: email sent re revert [16:25:04] (03CR) 10Alex Monk: "Assuming we only ever want this to work in labs and not labtest (codfw), should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/298507 (owner: 10Andrew Bogott) [16:25:23] (03CR) 10Andrew Bogott: "Hm... I was worried about what would happen when a projectadmin tries to alter a domain owned by another project (e.g. renaming foo.wmflab" [puppet] - 10https://gerrit.wikimedia.org/r/298280 (owner: 10Andrew Bogott) [16:25:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10Coren) [16:25:31] greg-g, I also sent one. Sorry, should have pinged you. [16:25:35] !log mattflaschen@tin Synchronized php-1.28.0-wmf.8/extensions/Echo/includes/ForeignWikiRequest.php: T119736: ForeignWikiRequest: Bail early for non-global users (duration: 00m 32s) [16:25:36] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [16:25:36] nah, s'ok [16:25:39] (03PS2) 10Andrew Bogott: Desigate policy: Allow projectadmins to manipulate domains [puppet] - 10https://gerrit.wikimedia.org/r/298280 [16:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:53] ^ grrrit-wm, RoanKattouw, deployed to wmf8 now as well. [16:26:22] grrrit-wm :) [16:26:27] ^ greg-g [16:26:57] matt_flaschen: just for me to catch up: is that a "fix" or a bandaid a la bd808's patch? [16:27:02] (03CR) 10Andrew Bogott: [C: 032] "eqiad.wmnet is hardcoded in plenty of other places here already :(" [puppet] - 10https://gerrit.wikimedia.org/r/298507 (owner: 10Andrew Bogott) [16:27:16] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:22] (03PS3) 10Andrew Bogott: cold-migrate: update 'node' db setting as well as 'host'. [puppet] - 10https://gerrit.wikimedia.org/r/298507 [16:28:31] jynus: it seems indeed that it was caused by many requests: https://graphite.wikimedia.org/S/Bb [16:29:29] Nikerabbit, can we put a limit, minimize it from happening again (even something as stupid as telling the user not to do it again?) [16:29:42] whatever it helps [16:29:49] then reenable [16:30:25] (my role is to fix immediate ongoing problems, the rest is up to you) [16:30:29] well, the API module does check if the user is blocked [16:30:35] Could slap some ping limiter on it. [16:30:46] greg-g, it should fix the Echo cause. [16:30:49] Or PoolCounter, to stop concurrent identical txns [16:31:21] ^ RoanKattouw [16:31:48] greg-g, I'm not certain if CentralAuth is doing the right thing here, though. [16:32:13] jynus: that's fine but I need to find someone to help me to figure out possible solutions [16:33:03] Nikerabbit, as I said, I can help, but not right now when there is another ongoing issue [16:33:25] (03PS2) 10BBlack: gdnsd config: zones_default_ttl = 3600 [puppet] - 10https://gerrit.wikimedia.org/r/298504 [16:33:35] I suppose there is can be some extra volunteers on this channel [16:34:46] ping limiter could do, but it needs configuration and a code change... is there someone here to deploy that if I write the code? [16:34:55] godog, hasharAway: how's it going? [16:35:03] 06Operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#2453940 (10Paladox) [16:35:05] 06Operations, 10Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#2453938 (10Paladox) 05Open>03declined Declining since were moving to lead as the new gerrit host then I presume ytterbium will be decommissioned. [16:35:11] Krenair: preparing dinner with kids :D [16:35:15] Krenair: I'm looking at your patch ATM, merged hasharAway's [16:35:19] ok [16:35:21] hasharAway, :D [16:35:26] godog: which one? [16:35:34] your CI one [16:35:38] jouncebot: ping [16:35:41] oh puppetswat [16:35:52] "contint: Put mysql db on tmpfs for role::ci::slave::labs" [16:36:04] Nikerabbit: Yes, I could. [16:36:23] that mysql related patch is a terrible hack really :- [16:36:54] (03CR) 10BBlack: [C: 032] gdnsd config: zones_default_ttl = 3600 [puppet] - 10https://gerrit.wikimedia.org/r/298504 (owner: 10BBlack) [16:36:55] but it at least has been proven to work on the ci slaves [16:36:57] (03PS4) 10Filippo Giunchedi: deployment-prep: Point upload cache at swift, fix rewrite.py to use beta.wmflabs.org domains [puppet] - 10https://gerrit.wikimedia.org/r/298297 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [16:37:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment-prep: Point upload cache at swift, fix rewrite.py to use beta.wmflabs.org domains [puppet] - 10https://gerrit.wikimedia.org/r/298297 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [16:38:06] ostriches: ok few mins, I started working 12h so I am not at my brightest [16:38:07] godog: I merged yours too (puppet-merge race - I only said yes to mine, but yours were there after the question was answered) [16:38:20] bblack: oh! ok thanks [16:38:39] heh puppet-merge doesn't lock the repo -- yay [16:38:55] Nikerabbit: No worries. I'll review anything you put up :) [16:39:49] yeah probably what puppet-merge should do is record the HEAD SHA-1 it asks the question about, and then only merge in that SHA-1 after the question is answered, instead of the potentially-new HEAD [16:41:25] (03PS3) 10Yuvipanda: Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 [16:41:28] (03PS3) 10Yuvipanda: Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 [16:41:30] (03PS2) 10Yuvipanda: Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 [16:41:32] (03PS3) 10Yuvipanda: Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 [16:41:34] bblack, yes, we commented implemeting such a thing [16:41:52] but I was to lazy and said it will not happen many times [16:42:19] !log disable puppet on ms-fe* and re-enable gradually to apply https://gerrit.wikimedia.org/r/#/c/298297/ [16:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:21] matt_flaschen: just checking in: how do the logs look? [16:45:32] 06Operations, 13Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#2454037 (10Dzahn) @bblack Thoughts on this? [16:47:15] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 404 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200): /unique-devices/{p [16:47:15] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 404 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200): /unique-devices/{p [16:47:24] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2454045 (10Andrew) p:05Normal>03Lowest [16:47:57] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 404 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200): /unique-devices/{p [16:48:18] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10BBlack) [16:49:11] 06Operations, 10Traffic, 10Wikimedia-Stream, 07HTTPS: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2454062 (10BBlack) [16:49:13] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454063 (10BBlack) [16:49:22] godog: thank you for the merge :) [16:49:48] greg-g, looks good, but too early to say. Also, the Echo fix only prevents additional inconsistent-state accounts, it doesn't affect logins for prior accounts already in that state. [16:49:49] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2454068 (10BBlack) [16:49:51] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10BBlack) [16:49:55] (03PS1) 10Nikerabbit: Add rate limiting for cxsave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298523 (https://phabricator.wikimedia.org/T140123) [16:50:08] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10BBlack) [16:50:10] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2454071 (10BBlack) [16:50:21] matt_flaschen: right, that still needs the maint script run. is that only runable with a list of explicit names? [16:50:24] greg-g, it looks like a slight decline, though: https://logstash.wikimedia.org/#dashboard/temp/AVXgBCVfT4MudYQNa35S [16:50:29] ouch I am checking AQS, those are not live nodes [16:50:37] greg-g: my (possibly wrong) understanding is that Echo caused user creations to be aborted halfway which CA then chokes on later. My patch should fix the former, but that just means we stop poisoning the well, already-poisoned user rows would still crash CA [16:50:43] I am sure it is aqs not being loaded [16:50:50] greg-g, my understanding is it can be run for all users, just slower. [16:51:02] * greg-g nods [16:51:19] hasharAway: np, of course reviews that remove code/legacy are better heh [16:51:21] RoanKattouw: matt_flaschen so, do we grep logs or just run for everyone? [16:51:40] Sorry I'm on my phone with low battery on 2G in a foreign country eating dinner so I'm gonna bail [16:51:42] Krenair: merged, looks good [16:51:52] thanks godog [16:51:57] !log deploying ores f472f65 to scb2001 [16:51:58] RoanKattouw: k, "enjoy" (at least more than already) vacation! [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:21] Krenair: np! thanks to you for working on this [16:53:52] (03PS6) 10BryanDavis: logstash: Update default mappings for Elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) [16:54:06] ostriches: two patches: https://gerrit.wikimedia.org/r/#/q/topic:cxsave,n,z [16:54:13] Just saw, looking [16:54:13] greg-g, I think all. anomie, do you want to run this script? I could, but I haven't used it before. [16:54:28] matt_flaschen: that graph does look encouraging [16:54:35] * greg-g takes a deep breath [16:55:12] Nikerabbit: lgtm, let's merge this. [16:55:19] everything looks good, going to the whole scb [16:55:24] (03PS2) 10BryanDavis: logstash: Remove all _* fields from gelf records [puppet] - 10https://gerrit.wikimedia.org/r/298382 [16:55:31] greg-g, make sure you only look at the times that are in the past (it shows future for some reason as well). It does still look pretty good now. [16:55:41] matt_flaschen: I haven't either. I think legoktm usually does it. [16:55:41] oh right [16:55:45] (03CR) 10Chad: [C: 032] Add rate limiting for cxsave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298523 (https://phabricator.wikimedia.org/T140123) (owner: 10Nikerabbit) [16:56:05] * greg-g changed view to last hour [16:56:09] !log deploying ores f472f65 to scb [16:56:12] Nikerabbit: After the extension change lands, we'll backport to wmf.8 [16:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:19] I'll go ahead and sync the config change now so it's ready [16:56:28] there's still some post deploy, what could cause that? [16:56:33] ostriches: tyvm [16:56:42] (03CR) 10BryanDavis: "PS2 was a manual rebase to detach this patch from the unrelated two patch chain that it was based on." [puppet] - 10https://gerrit.wikimedia.org/r/298382 (owner: 10BryanDavis) [16:56:57] greg-g, link? Well, first, as we discussed it doesn't fix existing accounts, second, there are clearly other causes besides Echo, see my latest post to task. [16:57:10] (Link for dashboard with better view, you can use share link in top right). [16:57:14] (03Merged) 10jenkins-bot: Add rate limiting for cxsave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298523 (https://phabricator.wikimedia.org/T140123) (owner: 10Nikerabbit) [16:57:25] matt_flaschen: https://logstash.wikimedia.org/#dashboard/temp/AVXgCpwEc8qLrUhXtIji [16:58:06] (03PS1) 10BBlack: Remove old rcstream public LVS config [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) [16:58:46] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: prep pinglimiter config for content translation (duration: 00m 33s) [16:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T1700). Please do the needful. [17:01:07] no parsoid deploy today. [17:03:23] matt_flaschen: thank you, btw, for your response here [17:03:29] where "here" == this issue [17:03:45] 06Operations: long-running root console sessions - https://phabricator.wikimedia.org/T105869#2454159 (10fgiunchedi) I've ran the audit again (almost a year later, to the day!) and just three hosts have root sessions ``` root@neodymium:~# grep ttyS login_console_audit.log {'restbase2004.codfw.wmnet': 'root... [17:10:00] 06Operations, 10Ops-Access-Requests: Add marktraceur to statistics-privatedata-users for access to stat1002 - https://phabricator.wikimedia.org/T140132#2454200 (10MarkTraceur) [17:12:53] matt_flaschen: anomie so, who's goign to run that maint script, we need it run so we can tell actual new rate of failure [17:14:31] greg-g, looks like me. I'm in a meeting, will do so afterwards. [17:15:05] godog, is it possible to import files via the swift cli? [17:15:21] matt_flaschen: at :30 or :00? [17:15:38] ooh "swift upload" [17:15:45] Krenair: yeah, was about to say [17:16:38] greg-g, 30, at latest. [17:17:04] (03CR) 10BryanDavis: "The mapping from PS5 was made live on the production cluster before the logstash-2016.07.12 index was auto-created. This worked well *exce" [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [17:18:12] Magic. [17:23:54] (03PS1) 10BBlack: Remove stream-lb.eqiad hostname [dns] - 10https://gerrit.wikimedia.org/r/298530 (https://phabricator.wikimedia.org/T134871) [17:24:59] 06Operations, 10RESTBase, 06Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken / RESTBase not discoverable - https://phabricator.wikimedia.org/T138848#2454262 (10Krinkle) [17:30:20] 06Operations, 10Gerrit, 10Mail, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2454295 (10Paladox) 05stalled>03Open Re opening since gerrit 2.12 is now moving on. Please go back to stalled if I am wrong or I should not change the status please. [17:30:20] greg-g, anomie, okay, done my meeting. Will start it now. [17:31:27] (03CR) 10Chad: "Works for me if we'd rather do them on-wiki (I'm not a huge fan of the on-wiki hiera, but I'd rather consolidate them *somewhere* rather t" [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [17:32:04] godog, would it be possible for you to give me a full list of production's swift containers? I'm not convinced I got everything created in beta [17:33:37] 06Operations, 10Gerrit, 10Mail, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2454300 (10demon) What does 2.12 have to do with it? [17:34:09] in particular global-data [17:34:44] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [17:34:46] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [17:35:31] WTF? [17:36:05] they spammed that in -ops too [17:36:18] we are in -ops [17:36:25] we are in -operations [17:36:30] -ops is the irc ops [17:36:30] oh [17:36:34] sorry [17:37:22] !log out of band ALTER TABLE recentchanges ADD KEY `name_type_patrolled_timestamp` on db1054 T140108 [17:37:24] T140108: ApiQueryRecentChanges::run is spiking, nuking API servers - https://phabricator.wikimedia.org/T140108 [17:37:24] (03CR) 10Chad: "(Responding to all the inlines at once): The point with these is to auto-provision based off of host names. I know we can apply them to a " [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [17:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:56] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [17:37:57] Krenair: yup, though setZoneAccess will create all needed containers, if the filebackend config matches I'm quite sure it'll be the same [17:38:43] Well that's the thing [17:39:06] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [17:39:06] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [17:41:09] Krenair: ok I'm fetching the list [17:41:26] (03CR) 10Yuvipanda: "@Chad yup, we're integrating that kind of functionality (prefix based class application) into horizon too" [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [17:42:56] setZoneAccess certainly appears to do all the local containers for the wiki you run it on [17:42:58] but I'm not sure it accounts for global-data [17:43:00] AaronSchulz might know [17:43:02] thanks [17:43:11] (03CR) 10Ottomata: Move stats::wmde cron files to analytics/wmde/scripts repo (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [17:43:51] (03PS1) 10BBlack: Revert "varnishxcache: support new err/bug outputs" [puppet] - 10https://gerrit.wikimedia.org/r/298532 [17:43:53] (03PS1) 10BBlack: varnishxcache: emit unknown, miss, pass, drop misspass [puppet] - 10https://gerrit.wikimedia.org/r/298533 [17:44:06] PROBLEM - very high load average likely xfs on ms-be3004 is CRITICAL: CRITICAL - load average: 103.88, 101.00, 93.72 [17:44:19] sad_trombone.wav [17:44:48] !log demon@tin Synchronized php-1.28.0-wmf.8/extensions/ContentTranslation/: ping limiter fixes (duration: 00m 29s) [17:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:03] Nikerabbit: ^^^^ [17:46:03] ostriches: okay, so CX can now be re-enabled? [17:46:15] I think so [17:46:32] (03CR) 10BBlack: [C: 032] Revert "varnishxcache: support new err/bug outputs" [puppet] - 10https://gerrit.wikimedia.org/r/298532 (owner: 10BBlack) [17:46:41] jynus: Any objections to us turning content translation back on? Nikerabbit added a ping limiter to it which should limit the rate at which people can hit the module. [17:46:56] (And we can always tighten/loosen the config values as needed later if the original values prove unworkable) [17:47:18] no objections [17:47:28] as I said, it was an emergency action [17:47:43] Yeah, just wanted to double check :) [17:47:46] Ok, let's do it then [17:48:06] RECOVERY - very high load average likely xfs on ms-be3004 is OK: OK - load average: 4.88, 54.93, 77.42 [17:48:06] (03CR) 10BBlack: [C: 032] varnishxcache: emit unknown, miss, pass, drop misspass [puppet] - 10https://gerrit.wikimedia.org/r/298533 (owner: 10BBlack) [17:48:11] greg-g, anomie, RoanKattouw, script to clean up users with inconsistent state is running. Command is at T119736 [17:48:11] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [17:48:19] (03PS1) 10Chad: Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 [17:48:24] ostriches, let's talk asyncronisly on the ticket about gerrit [17:48:31] Okie dokie [17:48:43] but I advance you: usually 1 weeks is prefered time [17:48:45] (03CR) 10Luke081515: [C: 031] Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 (owner: 10Chad) [17:48:52] to warn in advance about downtime [17:49:06] (unless there is an ongoing emergency) [17:49:10] Sure, but sometimes things are Important and have to happen Sooner :) [17:49:17] no [17:49:28] emergecy== it happens now [17:49:38] cmjohnson1, elukey is that disk actually ok? was it just in a bad config mode? [17:49:39] non-emergency: we announce it properly [17:49:57] otherwise we can provoke more problems than what we solve [17:50:12] 1 week or 3 days we can negotiate it [17:50:39] Yeah I agree we should announce, I'm saying a week is further out than I'd like. This is Urgent. [17:50:43] (if not Emergency) [17:50:49] but the users comes first, and we all agreed gerrits is a vital part of our infrastructure [17:51:06] I said 1 week as a good value, not a strict limit [17:51:15] what it is more important [17:51:22] in some months maybe not, then phabricator does that too ;) [17:51:32] heya Krinkle, yt? [17:51:33] is to notify people that are going to work [17:51:37] on it [17:51:54] you asked me for a test system [17:51:57] jynus: Which is why I proposed a weekend downtime. Gerrit is *critical* and any weekday (no matter what time of day) is gonna hurt more than a Sat/Sun [17:52:04] :) [17:52:05] greg-g, it looks like it took a little over a day last time. See https://phabricator.wikimedia.org/T119736#1895689 (I think that is when he started, since the --delete fix was merged very shortly before that) [17:52:10] but didn't tell me at all about your plans [17:52:16] and we agreed last time [17:52:20] I had to be present [17:52:28] (on the link I sent you) [17:52:40] greg-g, so if you want to restart the train today, let me know how I can help. But I am assuming otherwise you want to wait until this is done tomorrow. [17:52:52] blugh [17:53:19] ottomata: I am [17:53:23] so please involve someone on ops. It can be me, no problem, and it can be a weekend [17:53:32] greg-g, the Echo thing should not be creating new problem users, so it's just a question of whether you want to wait until the existing problem users to be fixed to better verify that. [17:53:34] Oh yeah I def need someone :) [17:53:41] Anyway yeah I think we're mostly in agreement here, maybe just some details to work out. We'll figure it out async on the task :) [17:53:47] (03PS1) 10Andrew Bogott: Disable the UpdateInstanceInfo tab. [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) [17:53:58] but do not expect someone to show up when you essentially sent an email on Saturday morning [17:54:00] greg-g, also, we don't know if there are other active causes, but I suspect maybe per what I posted in the task. [17:54:06] (03CR) 10Nikerabbit: [C: 031] ";)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 (owner: 10Chad) [17:54:24] so we can do Sunday if you want, but please announce it right now [17:54:26] matt_flaschen: yeah, logically it should be back to where we were pre-7/7, but we don't have the data. And the best we can do for identifying other causes is to wait until after this completes and re-look at the logs. [17:54:27] jynus: I know, and that was unreasonable to ask last minute. The timeline (was looking) was looking more compressed as of Friday night [17:54:39] But it kinda stretched out over the weekend. [17:54:43] * greg-g sighs and thinks [17:54:49] your Friday, my Saturday [17:55:02] greg-g, in the meantime, I will grep for the error (similar to bd808's backtrace, but the CentralAuth part). [17:55:11] +1 [17:55:20] Krinkle: looking to get this merged: https://gerrit.wikimedia.org/r/#/c/293628/ [17:55:32] greg-g, good news is the checkLocalUser.php is definitely fixing stuff. [17:55:35] Aaron said he doesn't self merge and to ask you :) [17:55:39] we are due to our users- scheduled downtime everybody understands [17:55:40] I put it on verbose, so it's noting users it's fixing. [17:55:46] (03CR) 10KartikMistry: [C: 031] Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 (owner: 10Chad) [17:55:49] matt_flaschen: good deal [17:55:59] scheduled downtime with a few hours == outage [17:56:14] ostriches: going to +2 that patch? ^ [17:56:16] so please propose a time, and we will go ahead with it [17:56:45] ottomata: OK [17:57:18] thanks Krinkle! :) [17:58:39] (03PS2) 10Dzahn: iegreview: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298411 [17:58:54] (03CR) 10Dzahn: [C: 032] iegreview: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298411 (owner: 10Dzahn) [18:00:34] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10Boshomi) When this work is done, protocol-relative URLs should be declared as depreca... [18:02:08] (03PS1) 10BBlack: Mailing list announcement link in 403 response for insecure-post [puppet] - 10https://gerrit.wikimedia.org/r/298539 (https://phabricator.wikimedia.org/T105794) [18:02:21] 06Operations, 10Gerrit, 10Mail, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2454419 (10Paladox) @demon since it is fixed with gerrit. Meaning in a updated version of gerrit it is fixed [18:04:25] Luke081515: Yes one moment. [18:04:30] (03CR) 10BBlack: [C: 032] Mailing list announcement link in 403 response for insecure-post [puppet] - 10https://gerrit.wikimedia.org/r/298539 (https://phabricator.wikimedia.org/T105794) (owner: 10BBlack) [18:04:46] ok [18:05:20] (03PS2) 10Chad: Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 [18:05:25] (03CR) 10Chad: [C: 032] Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 (owner: 10Chad) [18:06:10] (03Merged) 10jenkins-bot: Revert "Disable content translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298536 (owner: 10Chad) [18:06:30] greg-g, I'm still seeing new cases. Killing the CA script and investigating why. [18:06:41] greg-g: somebody should backport and deploy https://gerrit.wikimedia.org/r/#/c/298531/ to .8 and .9. I Think it will help [18:07:23] anomie: ^ I'm cool with you deploying that [18:07:32] backporting and* [18:08:38] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: turn cx back on (duration: 00m 29s) [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:49] Ok and CX is back now ^ [18:08:55] Luke081515, Nikerabbit, others: ^ [18:09:30] anomie: are you available to backport and deploy that? [18:09:32] thx :) [18:09:53] ostriches: hmm not working for me [18:10:13] Hmm, I just undid what I did before. [18:10:15] Caching? [18:10:22] greg-g: ok [18:10:22] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2454469 (10jcrespo) [18:10:23] (03CR) 10BryanDavis: [C: 031] Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 (owner: 10Yuvipanda) [18:10:27] no JS error... weird [18:10:45] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2401019 (10jcrespo) Remember to revert https://gerrit.wikimedia.org/r/269447 [18:11:08] (03CR) 10BryanDavis: [C: 031] Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 (owner: 10Yuvipanda) [18:11:18] anomie: ty [18:11:51] ostriches: could be caching, it semi-works with debug=true [18:12:18] Ok. Caches should catch up in a bit I hope [18:12:19] AaronSchulz: yt? got another mw+event q for you [18:12:48] Nikerabbit, ostriches: WFM [18:12:53] (without debug) [18:13:06] ok, now works without debug for me [18:13:26] Ok sweetness [18:13:26] ostriches: thank you again very much [18:13:33] Yw. Thanks for the quick fix [18:14:12] Nikerabbit: just a short question, you metioned the 100000th translation. did we already passed that, or will we pass it in a few days/hours? [18:14:25] (03PS3) 10Dzahn: iegreview: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298411 [18:14:48] Luke081515: passed already: https://en.wikipedia.org/wiki/Special:CXStats [18:14:55] thx for the info :) [18:16:17] Luke081515: feel free to share, I believe we have a blog post etc. coming later [18:16:31] Nikerabbit: congrats :) [18:17:00] Nikerabbit:heh, after ctt got disabled, I wrote a post at village pump@dewiki, and now, when I will write the updated, I want to mention that too ;) [18:18:27] matt_flaschen: let me know when you have an idea of you see new cases of echo-related ones or not :) [18:18:29] (03PS1) 10BBlack: X-Cache: mark miss->hit_for_pass as pass, attempt #2 [puppet] - 10https://gerrit.wikimedia.org/r/298542 [18:20:50] (03PS1) 10Jdlrobson: Enable lazy loaded references and images on Thai wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298543 (https://phabricator.wikimedia.org/T136731) [18:20:58] (03CR) 10Dzahn: "jenkins, would you please" [puppet] - 10https://gerrit.wikimedia.org/r/298411 (owner: 10Dzahn) [18:21:11] (03CR) 10Dzahn: [V: 032] "jenkins, would you please" [puppet] - 10https://gerrit.wikimedia.org/r/298411 (owner: 10Dzahn) [18:21:30] (03PS2) 10BBlack: X-Cache: mark miss->hit_for_pass as pass, attempt #2 [puppet] - 10https://gerrit.wikimedia.org/r/298542 [18:21:49] (03CR) 10BBlack: [C: 032 V: 032] X-Cache: mark miss->hit_for_pass as pass, attempt #2 [puppet] - 10https://gerrit.wikimedia.org/r/298542 (owner: 10BBlack) [18:22:54] !log anomie@tin Synchronized php-1.28.0-wmf.8/includes/auth/AuthManager.php: Commit transaction after auto-creating a user [[gerrit:298540]] (duration: 00m 30s) [18:22:58] greg-g, the latest one I saw is Echo-related, and after the fix should have hit wmf8, so I'm working on a troubleshooting patch to Echo now. [18:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:06] (03PS4) 10Yuvipanda: Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 [18:23:08] (03PS4) 10Yuvipanda: Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 [18:23:10] (03PS3) 10Yuvipanda: Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 [18:23:11] anomie, is that related? [18:23:12] (03PS4) 10Yuvipanda: Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 [18:23:12] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2454534 (10jcrespo) I think I have fixed all slave differences between m3-master and m3-slave. Most were false positives due to 5.5 and 10 or tool limitations, but there w... [18:23:40] matt_flaschen: I thought of another potential cause of the bug besides the Echo bug. [18:24:13] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10demon) >>! In T132521#2454408, @Boshomi wrote: > When this work is done, protocol-rel... [18:24:35] (03CR) 10BryanDavis: [C: 032] Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 (owner: 10Yuvipanda) [18:24:37] (03CR) 10Dzahn: "i am wondering if we should summarize these 3 lines and allow "/srv/phab/phabricator/bin/remove destroy *" or even "/srv/phab/phabricator/" [puppet] - 10https://gerrit.wikimedia.org/r/298494 (owner: 10Aklapper) [18:24:41] !log anomie@tin Synchronized php-1.28.0-wmf.9/includes/auth/AuthManager.php: Commit transaction after auto-creating a user [[gerrit:298541]] (duration: 00m 29s) [18:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:14] matt_flaschen: In short: since there was no database commit after the auto-creation, anything in the request failing would wind up rolling back the local user addition. So adding a commit in there after the auto-creation makes sense. [18:25:20] (03CR) 10BryanDavis: [C: 032] Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 (owner: 10Yuvipanda) [18:25:29] greg-g: Backport complete, FYI. [18:25:33] (03CR) 10BryanDavis: [C: 032] Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 (owner: 10Yuvipanda) [18:25:45] (03CR) 10BryanDavis: [C: 032] Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 (owner: 10Yuvipanda) [18:25:52] (03Merged) 10jenkins-bot: Take status of pod into account as well for webservice status [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298455 (owner: 10Yuvipanda) [18:26:11] (03Merged) 10jenkins-bot: Refactor to make spawning shell/webservice similar [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298459 (owner: 10Yuvipanda) [18:27:06] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:27:20] (03Merged) 10jenkins-bot: Add nodejs webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298464 (owner: 10Yuvipanda) [18:27:23] (03Merged) 10jenkins-bot: Permit doing webservice shell for k8s with a running ge job [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298473 (owner: 10Yuvipanda) [18:27:41] anomie, I think would also workaround the Echo cause (which should be fixed separately, but is still happening for some reason). [18:27:42] 06Operations, 10Phabricator: Phabricator weekly report not generated (or at least sent) - https://phabricator.wikimedia.org/T139950#2447582 (10jcrespo) As I comment on T138460#2454534, the slave is available from this very moment, after I fixed several data integrity issues. We can either enable the crons bac... [18:28:38] matt_flaschen: ty [18:28:45] (03PS1) 10Awight: Delist Special:CodeReview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298544 (https://phabricator.wikimedia.org/T116948) [18:29:05] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:29:14] matt_flaschen: It wouldn't because Echo is blowing things up during the UserSaveSettings hook call, while the new DB commit comes after that. [18:29:46] (03PS1) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298545 [18:30:05] anomie, oh, I see. [18:30:13] (specifically, Echo blows up from AuthManager.php line 1678, while this only fixes things that come after line 1700) [18:32:24] (03CR) 10Yuvipanda: [C: 032 V: 032] Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298545 (owner: 10Yuvipanda) [18:33:26] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:32] bd808, is there a way to make the loggers from LoggerFactory go somewhere by default, or does every logger (Flow, Echo) always need to be configured separately? [18:34:25] they all need config. We have way too much log to try and record all channels all the time [18:34:56] we do have a "log all the things" config on testwiki and test2wiki [18:35:06] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:36:09] To "well actually" myself, yes there is a way to do that, but no we can't do it for prod [18:36:10] ottomata: I think the disk is fine...the error wasn't a bad disk. it was a foreign config on a disk. maybe the last dyisk we swapped wasn't added back correctl [18:36:41] (03PS3) 10Dzahn: wikistats: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298409 [18:36:57] hmmm [18:36:57] aye [18:37:00] (03CR) 10Dzahn: [C: 032] wikistats: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298409 (owner: 10Dzahn) [18:37:04] interesting that it ran for a while though [18:37:04] hm [18:37:07] so cmjohnson1 its back up? [18:37:18] yes [18:38:41] !log foreachwiki ../../../../home/legoktm/checkLocalUser.php --delete=1 --verbose=1 on terbium [18:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:14] 06Operations, 06Commons, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2454595 (10kaldari) [18:39:32] Nikerabbit, I am not sure if it is needed 100%, but it would be nice to have a https://wikitech.wikimedia.org/wiki/Incident_documentation [18:41:33] jynus: I'll bring it up in tomorrow's daily [18:43:13] (03CR) 10Cmjohnson: [C: 032] Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [18:43:14] ok awesome, thanks cmjohnson1 [18:44:27] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/service/start - 274 bytes in 0.261 second response time [18:44:30] oh, legoktm is taking over the maint script run? why the local version? cc matt_flaschen [18:44:38] (03CR) 10Alex Monk: "Doesn't deleting instances in horizon and creating them in horizon also break wikitech compatibility?" [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [18:44:42] huh? [18:44:55] is someone else running it? [18:45:18] legoktm, yes, I said on the task and here. [18:45:20] I didn't see anything in SAL.... [18:45:26] matt_flaschen was, then stopped it since he saw another occurence and wanted to check something before restarting [18:45:33] legoktm, also, I cancelled it since Echo isn't working yet. [18:45:41] You're right, I should have logged it as well. [18:45:52] so is any script running? [18:45:59] yours [18:46:07] (03PS3) 10Cmjohnson: Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 [18:46:14] I just killed mine :S [18:46:17] okay, restarting it [18:46:31] legoktm, I think we should wait until the Echo one is fixed. [18:46:39] why? [18:46:46] (03CR) 10jenkins-bot: [V: 04-1] Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [18:46:49] legoktm, since it takes like a day, and we don't want to run it twice. [18:46:53] we can just run it again, but this'll help a bunch of users [18:47:07] Alright [18:48:28] (03PS4) 10Cmjohnson: Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 [18:48:48] !log Started checkLocalUser.php at ~2016-07-12 17:45 UTC, killed ~18:06 since Echo apparently is not fully fixed after all. [18:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:05] (03CR) 10BBlack: Create a new grub module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [18:49:24] matt_flaschen: is there a bug for echo not being fully fixed? [18:49:40] legoktm, no, I'll file. [18:49:42] (03PS5) 10Cmjohnson: Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 [18:49:57] legoktm, I don't know why yet. I'm about to put up a troubleshooting patch. [18:50:22] (03CR) 10Cmjohnson: [C: 032] Adding production dns entires for mc10[19-36] [dns] - 10https://gerrit.wikimedia.org/r/298488 (owner: 10Cmjohnson) [18:50:59] legoktm, T140144 [18:50:59] T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error - https://phabricator.wikimedia.org/T140144 [18:51:54] * legoktm moves to -collaboration [18:59:19] 07Blocked-on-Operations, 07Puppet, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2454701 (10hashar) a:03JanZerebecki Fixed month ago by @JanZerebecki and... [18:59:24] 07Blocked-on-Operations, 07Puppet, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2454703 (10hashar) 05Open>03Resolved [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T1900). [19:00:17] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:01:36] oh strontium [19:01:57] !log git pulled on strontium to sync with palladium [19:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:14] (03PS1) 10Cmjohnson: Removing dns entries for payments1006-8...updating mgmt asset tag [dns] - 10https://gerrit.wikimedia.org/r/298552 [19:02:26] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:04:20] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for payments1006-8...updating mgmt asset tag [dns] - 10https://gerrit.wikimedia.org/r/298552 (owner: 10Cmjohnson) [19:04:36] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:04:36] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:36] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [19:09:56] (03PS6) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [19:10:26] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:10:47] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 109, down: 1, dormant: 0, excluded: 1, unused: 0BRge-2/0/2: down - payments3BR [19:11:32] Jeff_Green: ^^^ some payments3BR iface is down [19:11:37] (03PS2) 10Dzahn: ipv6relay: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298412 [19:12:18] (03CR) 10Dzahn: [C: 032] "not used in site.pp anymore" [puppet] - 10https://gerrit.wikimedia.org/r/298412 (owner: 10Dzahn) [19:12:47] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 110, down: 0, dormant: 0, excluded: 1, unused: 0 [19:13:40] ignore that pfw alert.....swapping out payments1003 [19:13:54] i didn't realize we were alerting on port changes, that's great [19:14:42] i didn't either [19:16:02] \o/ [19:25:42] greg-g, about to deploy troubleshooting patch to Echo. [19:28:44] (03CR) 10Hashar: "Why not disable the overcommit entirely with ram_allocation_ratio=1.0 ?" [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [19:28:48] (03PS1) 10Yuvipanda: Add python (aka python3) images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/298557 [19:29:14] (03CR) 10jenkins-bot: [V: 04-1] Add python (aka python3) images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/298557 (owner: 10Yuvipanda) [19:29:35] (03CR) 10Yuvipanda: "I too would go for disabling overcommit totally..." [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [19:29:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:29:49] (03PS2) 10Yuvipanda: Add python (aka python3) images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/298557 [19:30:11] !log mattflaschen@tin Synchronized php-1.28.0-wmf.8/extensions/Echo/includes/ForeignWikiRequest.php: T119736: T140144: Troubleshoot why Echo is still triggering CA failures (duration: 00m 39s) [19:30:12] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [19:30:13] T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error - https://phabricator.wikimedia.org/T140144 [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:33:15] (03PS1) 10Nuria: Adding how long to wait between aggregated log retention checks [puppet/cdh] - 10https://gerrit.wikimedia.org/r/298558 (https://phabricator.wikimedia.org/T139178) [19:33:40] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:22] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [19:37:17] (03PS1) 10BBlack: add lvs_class salt grain [puppet] - 10https://gerrit.wikimedia.org/r/298560 [19:37:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:38:09] (03CR) 10BryanDavis: "Production cluster now updated with the latest version of this mapping." [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [19:38:25] (03PS2) 10BBlack: add lvs_class salt grain [puppet] - 10https://gerrit.wikimedia.org/r/298560 [19:40:26] (03CR) 10BBlack: [C: 032] add lvs_class salt grain [puppet] - 10https://gerrit.wikimedia.org/r/298560 (owner: 10BBlack) [19:41:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:47:12] greg-g, anomie, so it is happening even when there is a central ID. Basically, that means I'm not sure how I can tell it's not safe to use CentralAuth. [19:47:39] greg-g, anomie, one of the triggers at least is during autoCreate, but I don't know how I can tell that. [19:47:55] T140144 [19:47:56] T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error - https://phabricator.wikimedia.org/T140144 [19:47:59] ottomata: I uploaded another patch :) [19:48:25] (03PS1) 10BBlack: lvs_class grain bugfix [puppet] - 10https://gerrit.wikimedia.org/r/298561 [19:49:52] (03CR) 10BBlack: [C: 032] lvs_class grain bugfix [puppet] - 10https://gerrit.wikimedia.org/r/298561 (owner: 10BBlack) [19:50:28] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2454985 (10CCogdill_WMF) Thanks @faidon and @dpatrick for making this possible, and in the nick of time! I really appreciat... [19:50:41] (03PS4) 10Rush: Change ram_allocation_ratio to 1.2 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [19:51:02] (03CR) 10Andrew Bogott: "I'm not married to 1.2. But, see the attached bug for data about how very close we've been running to 1.0 already. (And also note one ho" [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [19:51:09] matt_flaschen: You could check $wgFullyInitialised to detect if you're being called during auto-creation or earlier during the setup process. Determining if ApiCentralAuthToken is safe would basically be duplicating the checks that the module does, see https://phabricator.wikimedia.org/diffusion/ECAU/browse/master/includes/api/ApiCentralAuthToken.php;3782d32bf81a829c3d71a55f4d5ac4c42c820071$40-58. [19:52:24] (03PS1) 10Dzahn: rancid: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298562 [19:52:53] mutante, can you delete the parsoid 0.4.0 deb pkg .. that is stale and is the source of unnecessary help me requests. [19:53:14] we have had the 0.5.1 version for a while and it is known to work. [19:53:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/298410 (owner: 10Dzahn) [19:53:58] anomie, okay, will try that. [19:54:07] (03PS2) 10BBlack: Remove old rcstream public LVS config in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) [19:54:09] (03PS1) 10BBlack: Remove old rcstream public LVS config in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/298564 (https://phabricator.wikimedia.org/T134871) [19:54:09] subbu: i will look when back from lunch, k [19:54:16] anomie, also, do you know about "Invalid key type: NULL"? See bottom of T140144. [19:54:16] T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error - https://phabricator.wikimedia.org/T140144 [19:54:27] mutante, wfm. thanks. no rush on it. if you need a ticket, i can create it. [19:54:48] otherwise, post-lunch is fine. [19:55:54] matt_flaschen: I think that one happens after a CAS conflict in updating the global user when CentralAuthUser->getAuthToken() calls CentralAuthUser->resetAuthToken(). [19:56:23] subbu: ok, eh, it's fine either way. [19:56:41] (03PS2) 10BBlack: Remove old rcstream public LVS config in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/298564 (https://phabricator.wikimedia.org/T134871) [19:56:43] (03PS3) 10BBlack: Remove old rcstream public LVS config [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) [19:58:48] anomie, should I file, or do you have it tracked? [20:01:40] (03PS3) 10BBlack: Remove old rcstream public LVS config in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/298564 (https://phabricator.wikimedia.org/T134871) [20:01:42] (03PS4) 10BBlack: Remove old rcstream public LVS config [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) [20:01:44] (03PS1) 10BBlack: remove rcstream lvs::realserver config [puppet] - 10https://gerrit.wikimedia.org/r/298566 (https://phabricator.wikimedia.org/T134871) [20:03:12] matt_flaschen: I don't have it tracked. [20:04:12] (03PS2) 10BBlack: remove default ttl/origin from top of all zonefiles [dns] - 10https://gerrit.wikimedia.org/r/298513 [20:04:24] (03PS1) 10Hashar: contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) [20:05:01] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:05:41] (03CR) 10BBlack: [C: 032] remove default ttl/origin from top of all zonefiles [dns] - 10https://gerrit.wikimedia.org/r/298513 (owner: 10BBlack) [20:05:50] (03PS2) 10BBlack: remove needless {{ zonename }} templating [dns] - 10https://gerrit.wikimedia.org/r/298514 [20:05:52] legoktm and/or anomie, can you review https://gerrit.wikimedia.org/r/#/c/298569/ ? [20:06:24] (03CR) 10BBlack: [C: 032] remove needless {{ zonename }} templating [dns] - 10https://gerrit.wikimedia.org/r/298514 (owner: 10BBlack) [20:06:32] (03PS2) 10BBlack: wmnet: explicit full $ORIGIN statements [dns] - 10https://gerrit.wikimedia.org/r/298515 [20:06:34] matt_flaschen: uh, I'll leave that to anomie :S [20:06:53] * anomie looks [20:07:36] Key type null is https://phabricator.wikimedia.org/T140156 . [20:07:39] (03CR) 10BBlack: [C: 032] wmnet: explicit full $ORIGIN statements [dns] - 10https://gerrit.wikimedia.org/r/298515 (owner: 10BBlack) [20:08:52] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:09:12] (03PS1) 10Alex Monk: Modify my .gitconfig's core.editor to add nano's --const parameter [puppet] - 10https://gerrit.wikimedia.org/r/298570 [20:09:12] (03CR) 10Hashar: "(random mumbling ignore)" [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [20:09:19] (03PS2) 10Alex Monk: Modify my .gitconfig's core.editor to add nano's --const parameter [puppet] - 10https://gerrit.wikimedia.org/r/298570 [20:09:33] (03CR) 10Ottomata: [C: 032] Adding how long to wait between aggregated log retention checks [puppet/cdh] - 10https://gerrit.wikimedia.org/r/298558 (https://phabricator.wikimedia.org/T139178) (owner: 10Nuria) [20:10:08] (03CR) 10Hashar: "Cherry picked on CI puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [20:10:49] (03CR) 10Hashar: "And unattended-upgrade --dry-run --verbose yields:" [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [20:11:16] (03CR) 10Andrew Bogott: [C: 032] Modify my .gitconfig's core.editor to add nano's --const parameter [puppet] - 10https://gerrit.wikimedia.org/r/298570 (owner: 10Alex Monk) [20:12:12] ty andrew [20:14:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:14:27] (03CR) 10Rush: [C: 031] "1:1 is probably the hard decision but the right one for now while we figure out what we need to do" [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [20:14:51] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:16:30] (03PS2) 10Rush: Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) (owner: 10Andrew Bogott) [20:18:03] (03CR) 10Alex Monk: [C: 04-1] Switch VisualEditor to a negative rather than positive dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [20:18:27] !log Start revision culling script for local_group_wikipedia_T_parsoid_html, from restbase1009.eqiad.wmnet : T140008 [20:18:28] T140008: High RESTBase storage utilization - https://phabricator.wikimedia.org/T140008 [20:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:46] matt_flaschen: anomie status? [20:20:15] matt_flaschen: Reviewed. [20:20:25] just this? https://phabricator.wikimedia.org/T119736#2454920 [20:22:22] (03PS2) 10Jforrester: dblists: Switch VisualEditor to a negative rather than positive one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 [20:22:28] (03CR) 10Jforrester: dblists: Switch VisualEditor to a negative rather than positive one (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [20:23:14] greg-g, no, there is https://gerrit.wikimedia.org/r/#/c/298569/1 under review. [20:24:04] (03PS2) 10Jforrester: dblists: Delete no-longer-used visualeditor-default.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296930 [20:24:08] (03CR) 10Alex Monk: dblists: Switch VisualEditor to a negative rather than positive one (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [20:25:09] anomie, for some uses of EchoForeignWikiRequest, I don't think direct DB access to the other wiki is feasible (e.g. the ones that return rendered notifications, since each wiki can have different notification-generating extensions installed). Anyway, I'm assuming that is an idea for later. [20:25:46] anomie, so what should I do now, add class_exists CentralAuthSessionProvider and MediaWiki\Session\SessionManager::getGlobalSession()->getProvider() instanceof CentralAuthSessionProvider . [20:25:51] ? [20:26:29] matt_flaschen: Shouldn't be any need for the class_exists bit. [20:27:02] anomie, okay, wasn't sure what instanceof would do if the class didn't exist. Anything else for the current patch? [20:27:27] * anomie tested it with php -r '$x = (object)[]; var_dump($x instanceof FooBar);', and then also php5 and hhvm --php [20:27:50] matt_flaschen: If you add the instanceof check, I think it'll work. [20:30:32] (03CR) 10Alex Monk: "It's pretty close:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [20:32:17] anomie, updated. [20:33:09] (03PS5) 10Andrew Bogott: Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) [20:33:11] (03PS3) 10Andrew Bogott: Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) [20:36:06] (03CR) 10Alex Monk: [C: 031] "Once/If we're happy with the parent commit, this looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296930 (owner: 10Jforrester) [20:39:59] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:01] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:02] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:04] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:05] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:07] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:09] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:10] o.O [20:40:10] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:12] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:13] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:15] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:21] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:23] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:24] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:26] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:27] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:28] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:29] Block [20:40:30] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:40:31] please [20:41:03] (03PS6) 10Andrew Bogott: Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) [20:41:52] !ops Lourdes is spamming [20:41:56] (that works, right?) [20:45:57] 06Operations, 10Phabricator: Phabricator weekly report not generated (or at least sent) - https://phabricator.wikimedia.org/T139950#2455231 (10greg) Can/Should this (the weekly report script) be run manually now but not enable the crons (there's other things there) until after the failover? [20:47:13] hmm [20:47:13] !ops [20:47:19] greg-g, nope [20:48:02] might be able to try "!ops #wikimedia-operations" in -ops? [20:49:48] <_joe_> uhm [20:49:56] <_joe_> I used to be able to become op here [20:50:04] <_joe_> someone changed the access rules? [20:50:09] (03PS1) 10Mforns: Fix the output directory for multimedia reports [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) [20:50:12] That user is already killed by a freenode staff. So solved ;) [20:50:20] (03PS4) 10Rush: Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) (owner: 10Andrew Bogott) [20:50:21] _joe_: wait, I will take a look [20:50:46] (03CR) 10Rush: [C: 031] "I don't have a total bead on best thing to do here but this seems like the right direction" [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) (owner: 10Andrew Bogott) [20:50:54] <_joe_> Luke081515: there are some names to remove too [20:51:00] hm [20:51:07] but you are not longer in that list... [20:51:16] <_joe_> yeah, I have zero idea why [20:51:20] and since verbose if off here, nobody noticed it.. [20:52:21] <_joe_> Luke081515: I'll ask someone with access to do the most important parts - there is some spring cleaning to do [20:52:31] (03CR) 10Nuria: [C: 031] "Looks good but me no merge powers." [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) (owner: 10Mforns) [20:52:37] (03CR) 10MarkTraceur: [C: 031] Fix the output directory for multimedia reports [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) (owner: 10Mforns) [20:52:48] (03CR) 10MarkTraceur: "(ditto)" [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) (owner: 10Mforns) [20:56:26] (03CR) 10Jforrester: "The Wikimania additions are intentional (they're all shut anyway). I'll add legalteamwiki in the next one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [20:57:35] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:36] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:37] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:39] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:40] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:46] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:47] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:49] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:51] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:52] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:54] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:57:55] Luke081515: whack a mole [20:57:58] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:00] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:01] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:03] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:05] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:06] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:08] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:09] o.O [20:58:10] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:16] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:18] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:21] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:23] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:25] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:28] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:29] !ops [20:58:30] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:36] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:38] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:40] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:41] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:43] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:44] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:50] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:51] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:51] greg-g, I just poked people in -ops [20:58:53] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:55] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:58:56] thanks [20:58:59] me too, redundantly ;) [20:59:03] I don't have ops here. [20:59:04] :( [20:59:12] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:13] I notified a staff [20:59:14] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:18] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:20] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:21] c ^ [20:59:22] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:24] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:26] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:27] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:29] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:31] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:34] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:36] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:38] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:43] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [20:59:44] por qué no te callas [20:59:45] !ops [20:59:50] !ops urgent help [21:00:01] yes, yes [21:00:02] NADIE ME CALLA MIERDA [21:00:06] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:08] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:09] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:11] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:12] mniip, ? [21:00:17] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:18] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:20] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:21] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:23] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:25] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:27] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:29] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:31] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:33] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:00:35] ^ Barras, AlexZ [21:00:48] RobH ^^ [21:00:49] matt_flaschen, they are aware of it [21:01:05] spb, LourdesCardenal [21:01:10] satdav, don't they have the right to kick? [21:01:19] matt_flaschen: they don't have ops rights here? [21:01:28] Perhaps if they use the wmfgc account but not with their own accounts [21:01:33] it's almost as chatty icinga-wm [21:01:43] +as [21:01:46] SPF|Cloud, they do have it: -ChanServ- 19 wmfgc +ARefiorstv (MANAGER) [modified 1y 14w 3d ago] [21:02:05] yeah, guessed wmfgc right then.. [21:02:06] so we have a spam-break now... [21:02:08] spb, assuming that was you, thanks :P [21:02:13] it wasn't [21:02:15] it was the bot [21:02:23] Oh, it only kicked in now? [21:02:28] It's been going on for a while [21:02:36] may have spread channel I guess [21:02:36] Sigyn is a bot? [21:02:45] apparently it moved into a channel that the bot was watching [21:02:54] Sigyn [sigyn@freenode/utility-bot/sigyn] [21:02:55] yep [21:02:57] yeah, sigyn is the k-lining one [21:02:59] anyway [21:03:08] spb, can you add the bot in here [21:03:10] we probably need more ops in here. [21:03:11] foks, it started in -ops and got killed probably because of that [21:03:15] foks: heh, I think that was just a kill ;) [21:03:16] not this [21:03:20] * foks nods. [21:03:23] Luke081515, both [21:03:25] both is good [21:03:26] Anyway, greg-g, anomie, RoanKattouw, legoktm, I'm about to deploy https://gerrit.wikimedia.org/r/#/c/298569/ for the Echo/CentralAuth issue. [21:03:33] hm, ok [21:03:46] matt_flaschen: /me nods [21:03:47] anyway, o/ [21:04:16] why can't we get Sigyn in here? [21:04:29] o.O [21:04:35] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:04:35] mniip: ^ [21:04:53] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:04:55] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:04:56] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:04:58] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:00] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:06] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:07] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:09] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:11] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:13] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:15] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:17] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:18] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:20] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:22] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:23] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:25] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:27] They really do not know we doint understand them [21:05:27] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:29] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:31] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:32] Luke081515 ^^ [21:05:33] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:34] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:36] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:37] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:40] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:42] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:43] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:45] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:47] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:48] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:48] meh [21:05:50] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:52] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:54] gj Sigyn [21:05:54] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:56] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:05:57] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [21:07:31] I think everyone listed in "/msg chanserv access #wikimedia-operations list" has rights to op themselves. [21:07:50] matt_flaschen: everyone with +o normally [21:07:50] greg-g, maybe you should in case they come back, so you can /kickban them. [21:08:10] I have it? [21:08:19] yes [21:08:26] greg-g, yes, run /msg chanserv op #wikimedia-operations [21:08:26] neat-o [21:08:34] for over 1y now :o [21:08:37] (03PS3) 10Jforrester: dblists: Switch VisualEditor to a negative rather than positive list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 [21:08:38] even better, /csop ;) [21:09:04] (03PS6) 10Chad: Gerrit: install Gerrit on lead (pointing at slave instance for testing) [puppet] - 10https://gerrit.wikimedia.org/r/298118 [21:09:05] Then /kickban UserName if they come back. [21:09:06] I got my /CSKICKBAN ready [21:09:06] ^ grrrit-wm [21:09:08] ^ greg-g [21:10:26] (don't worry, /cskickban auto ops and kicks for me) [21:10:30] Chanserv says I have no permissions to do the fun things [21:10:31] (and ban) [21:10:36] Oh, cool. [21:10:38] let's see, if this user tries to comme again ;) [21:10:43] matt_flaschen: irssi alias [21:11:03] (03CR) 10Alex Monk: [C: 031] dblists: Switch VisualEditor to a negative rather than positive list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [21:11:07] I temporary grouped the name he used, so if you connects again, he will get disconnected after 30 seconds [21:11:33] no he won't [21:11:43] enforce changes their nick, it doesn't disconnect [21:11:43] (03PS1) 10Eevans: Move node-specific versions to a cluster-wide setting [puppet] - 10https://gerrit.wikimedia.org/r/298631 (https://phabricator.wikimedia.org/T139639) [21:11:55] meh, sry [21:12:07] long work day... sometimes I'm a bit puzzled ;) [21:12:20] if he connects you can try /ns regain though [21:12:21] I mixed it up... [21:12:29] yep, but I'm offline now :-/ [21:12:39] you can try /ns regain, if you want to switch yourself to the spammer's nick [21:13:05] I'm offline for today ;) have a nice evening, hopefully without spam ;) [21:13:59] try not to ride a race condition and get klined by sigyn [21:14:10] ;) [21:14:14] * robh is adding more opsen to the channel access list [21:14:18] its pretty outdated. [21:14:30] Thanks, robh. [21:14:32] and don't run /msg ChanServ AKICK #wikimedia-operations ADD *!*@* !P or you'll be in a lot of trouble (that's serious, actually, don't run it.) [21:15:05] hi sorry been at work [21:15:10] 06Operations, 10Incident-20151216-Labs-NFS, 06Labs: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2455332 (10chasemp) We are basically in a holding pattern as we (...well me I guess) tries to get labstore2003/2004 going so we can shift load so... [21:15:56] 06Operations, 06Labs: Failed drive in labstore2001 array - https://phabricator.wikimedia.org/T139937#2455341 (10chasemp) 05Open>03Resolved ```md0 : active raid1 sdb1[2] sda1[0] 1952839680 blocks super 1.2 [2/2] [UU] bitmap: 1/15 pages [4KB], 65536KB chunk ``` [21:18:02] greg-g: yeah you were once made honorary ops right? [21:18:13] i didnt realize it also gave ops permissions in here, but yay? ;D [21:18:38] (03PS1) 10EBernhardson: vagrant-lxc requires ruby build dependencies [puppet] - 10https://gerrit.wikimedia.org/r/298636 [21:18:48] (03CR) 10Eevans: [C: 031] "Puppet compiler output: http://puppet-compiler.wmflabs.org/3318/" [puppet] - 10https://gerrit.wikimedia.org/r/298631 (https://phabricator.wikimedia.org/T139639) (owner: 10Eevans) [21:18:54] but i just also added the missing half of the ops team from the list. im sure i likely missed someone but im not sure if embedded ops in other teams shoudl be added or not [21:19:05] and im not willing to anger folks by adding them out of turn ;D [21:19:05] robh: :) :) [21:19:32] * greg-g tests updated aliases, ignore [21:19:52] robh, it's like a ceremonial key to the city that is also a master key for City Hall. [21:20:28] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2455380 (10Legoktm) 05Open>03Resolved [21:20:32] if not the entirety of city hall, at least the entryway ;D [21:21:44] (03CR) 10Chad: [C: 031] "Passes compiler, https://puppet-compiler.wmflabs.org/3317/lead.wikimedia.org/. Let's give it a shot so we can get most of the service setu" [puppet] - 10https://gerrit.wikimedia.org/r/298118 (owner: 10Chad) [21:22:29] (03CR) 10Paladox: [C: 031] Gerrit: install Gerrit on lead (pointing at slave instance for testing) [puppet] - 10https://gerrit.wikimedia.org/r/298118 (owner: 10Chad) [21:24:13] (03PS7) 10Dzahn: Gerrit: install Gerrit on lead (pointing at slave instance for testing) [puppet] - 10https://gerrit.wikimedia.org/r/298118 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [21:24:21] !log mattflaschen@tin Synchronized php-1.28.0-wmf.8/extensions/Echo/includes/ForeignWikiRequest.php: T140144: Echo/CentralAuth: Bail if not fully initialized (duration: 00m 49s) [21:24:22] T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error - https://phabricator.wikimedia.org/T140144 [21:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:29] (03CR) 10Dzahn: [C: 032] Gerrit: install Gerrit on lead (pointing at slave instance for testing) [puppet] - 10https://gerrit.wikimedia.org/r/298118 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [21:36:00] (03CR) 10Dzahn: "recheck wth recheck" [puppet] - 10https://gerrit.wikimedia.org/r/298118 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [21:36:43] (03CR) 10Dzahn: [C: 032] "gate-and-submit ..." [puppet] - 10https://gerrit.wikimedia.org/r/298118 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [21:36:51] paladox: indeed [21:37:06] Yep [21:37:11] your welcome [21:37:57] mutante: I'll handle lead now [21:38:03] ostriches: it just got the IP [21:38:05] ok [21:39:39] Ok, letsencrypt worked, but got failures elsewhere [21:40:07] ostriches :) we get free ssl now [21:40:32] Ahhh, Error: Could not find user gerrit2 [21:40:40] I don't create it as a system user rn. [21:40:44] Since it used to pull from ldap. [21:40:47] Oh [21:42:55] greg-g, anomie, no "badsession: Can only obtain a centralauthtoken when using CentralAuth sessions" since 21:24:00 (same time I did the deploy). There are still "Invalid key type: NULL", but I'm not sure Echo can do anything about that (and we're now already catching the exception). [21:43:55] I'm going to take a break to get some lunch. Should be back in < 20, will be available on hangouts. [21:43:59] great [21:44:21] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: Puppet has 1 failures [21:46:15] (03PS1) 10Chad: Gerrit: Create gerrit2 user [puppet] - 10https://gerrit.wikimedia.org/r/298640 [21:46:47] (03CR) 10Paladox: [C: 031] Gerrit: Create gerrit2 user [puppet] - 10https://gerrit.wikimedia.org/r/298640 (owner: 10Chad) [21:56:25] 06Operations, 10RESTBase, 10Traffic, 07HTTPS, 05Security: Enforce HTTPS for authenticated public connections - https://phabricator.wikimedia.org/T88862#2455579 (10BBlack) This can probably be closed now, as all public RB access is via the standard cache clusters which are enforcing HTTPS, but defer to gw... [22:03:32] (03PS7) 10Andrew Bogott: Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) [22:04:26] (03PS5) 10Andrew Bogott: Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) [22:04:40] It seems the bot keeps rejoinning [22:04:41] * LourdesCardenal (b959fa06@gateway/web/cgi-irc/kiwiirc.com/ip.185.89.250.6) has left [22:05:30] (03CR) 10Andrew Bogott: [C: 032] Change ram_allocation_ratio to 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/298480 (https://phabricator.wikimedia.org/T140119) (owner: 10Andrew Bogott) [22:06:02] (03CR) 10Andrew Bogott: [C: 032] Lower disk overcommmit ratio to 1.5. [puppet] - 10https://gerrit.wikimedia.org/r/298508 (https://phabricator.wikimedia.org/T140122) (owner: 10Andrew Bogott) [22:06:20] PROBLEM - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol [22:06:37] (03PS2) 10Chad: Gerrit: Create gerrit2 user/group [puppet] - 10https://gerrit.wikimedia.org/r/298640 [22:06:38] that's the new gerrit server, not shocking [22:06:50] it's WIP [22:07:21] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:07:40] ACKNOWLEDGEMENT - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol daniel_zahn setup in progress [22:07:41] ACKNOWLEDGEMENT - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn setup in progress [22:07:41] ACKNOWLEDGEMENT - puppet last run on lead is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn setup in progress [22:07:49] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:07:54] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:07:55] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:07:57] Oh my god [22:07:58] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:00] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:00] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:01] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:02] Spam [22:08:02] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:04] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:05] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:07] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:08] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [22:08:10] greg-g ^^ [22:08:31] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 13404 MB (3% inode=99%) [22:08:44] We should give some +o access to the regular users of this channel. [22:09:09] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 13Patch-For-Review: Increase time before alert for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2455625 (10debt) moving this to the backlog for now, will pick it up again when we have time. [22:09:49] * ostriches raises hand [22:09:58] Maybe we could create a bot that blocks anyone that says this Lourdes es una basura descompuesta nacida el año 1. in there first sentence [22:10:20] and has a suspicus name as a prevention of wrongful blocks. [22:11:12] paladox: they'll change sentences [22:11:25] humans are better than bots to adapt and avoid false positive [22:11:25] Oh [22:11:52] Yep [22:11:53] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2455650 (10debt) everything is a string now - it's in production. [22:12:23] I'm only comfortable using my rights to give them to other flks on the operations team. anythign more than that should likely have an access request. [22:12:31] Maybe create !admins so that some people can get notified with ease to people reporting spam [22:12:35] but i would +1 the idea =] [22:12:50] spam in here isnt typically a big issue afaik [22:13:10] paladox: please separate potential ping words, like !_admin. You just pinged a lot of people. [22:13:29] Oh sorry i didnt meant to i didnt even think that worked [22:13:40] That's okay, just letting you know for the future. [22:13:45] A lot of us in here are Wikipedia admins [22:13:47] what pings for that? one of the bots? [22:13:57] Wikimedia admins [22:14:01] yes ^ [22:14:05] (or you mean you all have it as pings in your client?) [22:14:13] ahh [22:14:35] Oh woops sorry if i knew it would do that i woulden do that (Sorry about that) [22:14:38] Yes, it's common for those with that user right on-wiki to set that ping word on IRC :) [22:14:45] back to your conversation, thanks for your time :) [22:15:13] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, 07HTTPS: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2455672 (10Danny_B) [22:15:22] (03CR) 10Paladox: [C: 031] Gerrit: Create gerrit2 user/group [puppet] - 10https://gerrit.wikimedia.org/r/298640 (owner: 10Chad) [22:16:17] 06Operations, 06Discovery-Search-Backlog: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2455677 (10debt) p:05High>03Normal moving to the backlog board until we have more time to look at this. [22:16:47] if they keep joining from different web gateways, its going to get annoying. [22:18:11] but it was the same each time [22:18:15] so now its on ban list ;] [22:18:22] :) [22:18:47] i'd say they are likely smart enough to figure it out but it also doesnt seem smart to just spam a random irc channel so meh [22:19:02] Yep [22:19:30] (03CR) 10Chad: [C: 031] "Puppet compiled, used correct UID/GIDs. We'll see if it actually works on ytterbium or if it COMPLAINS REALLY LOUD." [puppet] - 10https://gerrit.wikimedia.org/r/298640 (owner: 10Chad) [22:19:56] 06Operations, 10RESTBase, 10Traffic, 07HTTPS, 05Security: Enforce HTTPS for authenticated public connections - https://phabricator.wikimedia.org/T88862#2455697 (10GWicke) 05Open>03Resolved a:03GWicke @bblack, you did indeed resolve this without us doing anything. Many thanks, sir! [22:20:53] robh it is using kiwiirc.com too [22:21:11] i hpoe thats not a common use gateway since i just banned it [22:21:21] Oh [22:21:29] https://kiwiirc.com/client [22:21:31] Let me try [22:21:48] if it is i can just unban it in 24h [22:21:56] should be long enough for the troll to go elsewhere [22:21:57] Yep [22:22:18] heh, so it was just him [22:22:21] or her. [22:22:24] Yep [22:22:30] (03CR) 10Dzahn: [C: 032] "yea, we have reserved 444 UID/GID on https://wikitech.wikimedia.org/wiki/UID and don't want to touch the existing setup on ytterbium.. whe" [puppet] - 10https://gerrit.wikimedia.org/r/298640 (owner: 10Chad) [22:22:34] But different ip [22:22:34] then it can just stay in the list, thanks for checking! [22:22:39] oh, yeah. [22:22:40] Your welcome [22:23:21] differnt ip and different uid string in the connection line but meh [22:23:27] it can sit forever then i imagine [22:23:30] Yep [22:29:04] 06Operations, 10Phabricator: Phabricator weekly report not generated (or at least sent) - https://phabricator.wikimedia.org/T139950#2455764 (10Danny_B) @greg That's exactly what I was asking earlier... [22:32:25] (03PS1) 10Chad: Gerrit: configure lucene indexing for new gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298651 [22:33:02] RECOVERY - gerrit process on lead is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:35:17] (03PS1) 10Andrew Bogott: Replace labvirt1010 in the nova scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/298653 [22:40:23] (03CR) 10Paladox: [C: 031] Gerrit: configure lucene indexing for new gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298651 (owner: 10Chad) [22:52:20] (03PS1) 10Chad: Gerrit: Use proper variable for git directory location on lead [puppet] - 10https://gerrit.wikimedia.org/r/298659 [22:52:22] (03PS1) 10Chad: Gerrit: Properly (not) redirect for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/298660 [22:52:42] greg-g, anomie, RoanKattouw, looks good except for T140156 (but even that is now being caught rather than blowing up) [22:52:42] T140156: CentralAuth 'Invalid key type: NULL' - https://phabricator.wikimedia.org/T140156 [22:52:43] mutante: That chain ^ [22:53:00] (03Abandoned) 10Chad: Generate mediawiki-installation dsh group file from hiera data [puppet] - 10https://gerrit.wikimedia.org/r/247324 (https://phabricator.wikimedia.org/T86644) (owner: 10Chad) [22:53:42] ostriches: the first one will make ytterbium gerrit restart, right [22:53:43] matt_flaschen: no new entries in the logs? [22:54:06] mutante: Ah it would. [22:54:10] Lemme live-hack it first. [22:54:11] So it won't [22:54:25] cool [22:54:33] greg-g, not of the cause (badsession exploding in Echo). For the already-messed-up users, we have to wait for the script. legoktm is going to let the current run finish, then re-do it (since the cause was fixed mid-run). [22:54:59] (checkLocalUser.php) [22:55:06] right right [22:55:14] alright, with that, the train is on for tomorrow [22:55:17] ostriches: ^ :) [22:55:40] k [22:55:44] matt_flaschen: everything is in master, right? we haven't branched wmf.10 yet, so just making sure they're not all backports [22:56:14] !log ytterbium: disabled puppet for a moment so we can do a config change w/o gerrit restarting itself [22:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:29] (03CR) 10Dzahn: [C: 032] Gerrit: configure lucene indexing for new gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298651 (owner: 10Chad) [22:56:30] mutante: Ok should be good [22:56:55] ok, first one is on master [22:57:11] matt_flaschen: is it accurate for me to say that the vast majority of cases have been addressed? [22:57:38] (03CR) 10Dzahn: [C: 032] Gerrit: Use proper variable for git directory location on lead [puppet] - 10https://gerrit.wikimedia.org/r/298659 (owner: 10Chad) [22:57:50] Dammit gerrit. [22:57:58] yup 503 [22:57:59] mutante: Ugh, I ran too soon [22:58:08] It hadn't picked up by the time I ran puppet [22:58:12] greg-g, yes, and even the invalid type thing shouldn't mess up the auto-creation anymore. [22:58:19] awesome [22:58:23] thank you [22:58:26] i noticed when i hit merge on the second [22:58:26] ok [22:58:33] greg-g, but there are probably still messed-up users from before. [22:58:39] Until the script finishes twice. [22:58:56] right [22:59:08] ostriches: ok, second is on master [22:59:33] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2455905 (10RobH) 05Open>03Resolved switch port descriptions removed [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160712T2300). Please do the needful. [23:00:04] matt_flaschen and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:52] ostriches: done. let's see LE again :) [23:01:22] Ok, ytterbium happy with the config backport [23:01:25] It shouldn't restart again [23:01:31] great [23:01:31] Present. Note, I added one more for SWAT. [23:02:06] \o [23:02:10] Hello. I can SWAT this evening. [23:02:19] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:20] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:22] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:23] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:25] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:26] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:27] robh ^^ [23:02:28] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:29] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:31] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:32] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:33] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:35] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:36] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:38] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:39] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:45] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:46] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:48] Lourdes es una basura descompuesta nacida el año 1. es una vieja que tiene unas piernas muy gordas y usa falditas [23:02:57] Thanks [23:02:58] greg is faster than me. [23:03:07] Yep [23:03:10] Could you add +q *!*@gateway/web/cgi-irc/kiwiirc.com/* for the hours to come? [23:03:10] robh: /cskickban user is all I do :) [23:03:10] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures [23:03:50] +q is a quiet line, any person with this gateway could still join but not speak (and so they don't always understand their flood isn't efficient) [23:03:55] ostriches: just checking, you comfortable doing an abbreviated train this week starting tomorrow? [23:04:10] We had positive results on #wikipedia-fr with +q. [23:04:18] greg-g: Yeah [23:04:20] robh: we should probably setup antispammeta in here, then spammers and etc get reported in #wikimedia-ops and the group contacts can deal when no ones around (if they are on the list) [23:04:32] ostriches: cool, emailing the plan out. Thanks. [23:06:13] Dereckson, that is a general-purpose client, though. It's kind of sucky to hell-ban the whole gateway. If good-faith people happen to use it, they won't even know they've been banned. [23:06:20] (Kiwiirc) [23:06:55] im not sure how #wikimedia-ops would resolve stuff in here, its a totally different list of folks. [23:07:38] matt_flaschen: Yes, there is a need to watch a little bit, and avoid to keep the +q once the attack is stopped. [23:08:20] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [23:08:59] greg-g, all of the patches for the Echo thing are in master, wmf8, and wmf9. [23:09:11] mutante: Can you ack that ^ [23:09:27] matt_flaschen: thanks [23:10:02] ACKNOWLEDGEMENT - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol daniel_zahn setup in progress [23:10:02] ACKNOWLEDGEMENT - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn setup in progress [23:10:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1997 [23:10:22] greg-g, ostriches, so you're not going to go back to wmf9, right? [23:10:35] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/298661/1,publish is only for wmf8 so? [23:10:42] matt_flaschen: ostriches I mean, I guess we could do it right now/post swat [23:10:45] matt_flaschen: Don't see why we should bother [23:10:49] but yeah [23:10:53] "meh" [23:11:08] ostriches, I agree, just checking, since if so I need to make sure someone updates the wmf9 submodule and deploys it. [23:11:13] mutante: It'll be a bit before I can run puppet successfully, it needs to finish running the reindex. [23:11:19] But I'm fine dropping wmf9. [23:11:31] Dereckson, yes, only wmf8. [23:11:36] yeah, it had a nice run until Monday :) [23:11:49] ostriches: ok! i was watching progress. it did create the gerrit config. getting closer again [23:12:05] !log ytterbium: puppet enabled again, all happy again [23:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:16] !log lead: puppet disabled for a bit while index building is in progress. [23:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:27] mutante: Really, the puppet manfiest *cannot* do the full setup itself. [23:12:31] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/298405/2/wmf-config/InitialiseSettings.php <- could you add the task identifiant at the end of this line? [23:12:36] I'm tempted to rip that bit out. [23:13:00] Well, for a new node. [23:13:00] Dereckson, sure, thanks for reminding me. [23:13:05] reindexing is a pain. [23:13:24] ostriches are you doing the web version of indexing [23:13:36] i thought they said that would be fast. [23:13:52] the web versioning? I have no clue what that means. [23:14:20] Ohhttp://stackoverflow.com/questions/31322148/online-reindexing-in-gerrit-2-11 [23:14:23] ostriches ^^ [23:14:55] grrrit-wm is down? [23:15:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 5450898 Threads: 3 Questions: 77862019 Slow queries: 747191 Opens: 855209 Flush tables: 3 Open tables: 64 Queries per second avg: 14.284 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:15:12] https://gerrit-documentation.storage.googleapis.com/Documentation/2.11.2/cmd-index-start.html [23:15:16] root@lead /var/lib/gerrit2/review_site# java -jar bin/gerrit.war index --help [23:15:16] fatal: unknown command index [23:15:16] (no com.google.gerrit.pgm.index) [23:15:31] Oh [23:15:45] Ahhh, I know what happened. [23:15:48] That's changed a bit. [23:15:53] Soooo, schema upgrades possible yes. [23:15:53] https://gerrit-documentation.storage.googleapis.com/Documentation/2.12.3/cmd-index-start.html [23:15:56] But initial indexing is not. [23:15:58] yep [23:15:59] jdlrobson: Enable lazy loaded references and images on Thai wikipedia live on mw1017 [23:16:02] Dereckson, updated. [23:16:09] Dereckson: awesome. will take a look [23:16:10] Thanks matt_flaschen. [23:16:12] paladox: So that's useless right now :) [23:16:16] Since it's an initial install :) [23:16:32] Oh, yep since we are going to upgrade the schema for 2.8. [23:16:37] No that's not it. [23:16:42] oh [23:16:51] Dereckson: i'm not seeing it in action. you sure? [23:17:06] paladox: Basically that means "once you have an index you can upgrade it in place" [23:17:12] Oh [23:17:14] We don't have an index yet, since it's a new server. [23:17:21] oh [23:17:23] dereckson@mw1017:~$ md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php [23:17:24] You'd get the same problem installing on your localhost :) [23:17:26] ec0f95bc465f8bacf508eee037ac3881 /srv/mediawiki/wmf-config/InitialiseSettings.php [23:17:30] and on tin: ec0f95bc465f8bacf508eee037ac3881 wmf-config/InitialiseSettings.php [23:17:32] Oh [23:17:35] After init it tells you to run reindex :) [23:17:49] Whatevs, we already at like 60% :) [23:18:01] Oh [23:18:24] It's probaly going through all the repo's i wonder if there will be improvements in searching [23:18:31] since we use lucene now [23:18:54] jdlrobson: `mwrepl thwiki` on mw1017, print_r($wgMFLazyLoadImages); gives me expected values [23:19:10] ([ "base" => 1, "beta" => 1 ]) [23:19:54] Dereckson: mw.config.values.wgMFLazyLoadImages should be true on https://th.m.wikipedia.org/wiki/%E0%B9%80%E0%B8%84%E0%B8%97%E0%B8%B5_%E0%B9%80%E0%B8%9E%E0%B8%A3%E0%B9%8C%E0%B8%A3%E0%B8%B5 but it's not [23:20:07] i've purged page and still same [23:20:32] 5 minutes are sometimes needed for JS extensions like WikiLove [23:21:21] jdlrobson: https://th.m.wikipedia.org/wiki/%E0%B9%80%E0%B8%84%E0%B8%97%E0%B8%B5_%E0%B9%80%E0%B8%9E%E0%B8%A3%E0%B9%8C%E0%B8%A3%E0%B8%B5?debug=true <- i've true [23:21:32] (but false without the ?debug=true) [23:21:40] so, that sounds like a cache issue [23:22:03] Dereckson: nope. it's not pure js [23:22:03] true also for non debug version on my side [23:22:17] this should be instant. It has been on other changes. [23:22:40] you still have a false? [23:23:04] There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:23:31] mutante: yes, we're in SWAT process if this change is Enable lazy loaded references and images on Thai wikipedia [23:23:48] Dereckson: the change is not live. Can you try resyncing? [23:23:58] You only have true because you are opted into beta I suspect [23:24:03] ok, just wanted to make sure that is not why something doesnt work yet that you expected to work [23:24:30] Dereckson the patch has not been deployed correctly but i'm not sure why [23:24:43] jdlrobson: hey, I only deployed it on mw1017 [23:25:04] jdlrobson: so when I tried it on mw1017, I got a false, then I added ?debug=true to the URL, I got a true, then when I tried again to your URL 10 seconds later, I got a true [23:25:32] ah you mean testwiki? [23:25:37] how will that help me here? [23:25:45] it doesn't impact testwiki - it's thai wiki only [23:25:54] no I mean with https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:26:12] Dereckson: i'm not setup to use that. [23:26:18] and it shouldn't be necessary [23:26:21] What browser do you use? [23:26:27] this change is already live on other sites [23:26:32] i just need it turned on for thai wiki [23:27:25] jdlrobson: we've decided to always use mw1017 before sending changes to prod, so you need to set that up [23:27:41] that's rather easy, https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug provides extension for Chrome and Firefox [23:27:52] greg-g: is this really necessary for a simple config variable switch? [23:28:21] In other browsers, you can also manually achieve this with this header: X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet [23:28:40] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:29:12] jdlrobson: sometimes you ask to deploy things a little bit more complicated than just a config variable switch by the way, so even if you made an assertion needed / not needed, you would need that [23:29:19] jdlrobson: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki [23:29:31] !log stat1003 still on every puppet run a mongodb gets started..over and over again [23:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:53] mutante: 95% done [23:29:58] ostriches: :) [23:29:59] right, I meant to email out the update on this change [23:30:08] (I ran it with `time` this time too so we can better gauge our downtime for the real upgrade) [23:30:28] nice [23:31:16] jdlrobson: what's your full test procedure for this change by the way? If it's only ensure mw.config.values.wgMFLazyLoadImages is true, that's the case, it's tested. [23:31:41] Dereckson: so yes with those headers i can verify this works as expected [23:31:48] * Dereckson nods [23:31:52] images lazy load [23:32:24] central syslog server is starting to run out of disk ... [23:33:19] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable lazy loaded references and images on Thai wikipedia (T136731) (duration: 00m 38s) [23:33:20] T136731: Deploy lazy loaded images, lazy loaded images + references to a couple larger wikis - https://phabricator.wikimedia.org/T136731 [23:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:25] greg-g, I'm going to write an incident report now. [23:34:00] We have "Days since last incident" now? [23:34:14] s/Days/Hours/ [23:34:17] jdlrobson: mw.config.values.wgMFLazyLoadImages at true also on prod :) [23:34:31] Great! thank you Dereckson [23:34:41] matt_flaschen: yeah its some lua magic that timo whipped up [23:34:41] matt_flaschen: thank you very much [23:34:47] sorry i was just not familiar with this new step in SWAT so it threw me [23:34:57] jdlrobson: email being sent now :) [23:35:00] my bad [23:35:36] email sent [23:35:37] 06Operations: lithium (central syslog server) is starting to run low on disk space - https://phabricator.wikimedia.org/T140189#2456097 (10Dzahn) [23:36:34] matt_flaschen: I've CR+2 the gom. Flow change, we wait Zuul. [23:36:43] Merged. [23:37:06] thanks for that email greg-g [23:37:13] matt_flaschen: live on mw1017 [23:40:03] twentyafterfour: so there is an arcanist package for Ubuntu Trusty too? (Trusty is still used by tools labs) [23:40:48] Dereckson, basic regression testing looks good on mw1017 (I can see messages from other wikis on English Wikipedia). [23:41:04] matt_flaschen: and Create Flow boards in any location (flow-create-board) [23:41:15] appears to https://gom.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B6:ListGroupRights [23:41:20] Dereckson: the arcanist package should be platform-agnostic but I'm not sure if it got uploaded to all of the right repos [23:41:21] Dereckson, oh wait [23:41:24] I tested the wrong one. [23:41:33] !log lithium deleted some logs older than 60 days to make space [23:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:43] Dereckson: we just need someone from operations to make it available for all of the distros we use [23:41:50] twentyafterfour: but I don't see it @ apt.wikimedia.org [23:42:05] ok [23:42:45] Dereckson, flow-create-board confirmed on gomwiki 1017, that's the one on mw1017, right? [23:42:55] right [23:43:00] RECOVERY - Disk space on lithium is OK: DISK OK [23:43:24] 06Operations: lithium (central syslog server) is starting to run low on disk space - https://phabricator.wikimedia.org/T140189#2456172 (10Dzahn) To prevent it from running full i deleted some files older than 60 days to get some space back on that partition. 16:48 < icinga-wm> RECOVERY - Disk space on lithium... [23:43:55] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add flow-create-board for gomwiki sysop (T139226) (duration: 00m 27s) [23:43:56] T139226: Give Flow board creation rights automatically to admins on the Konkani Wikipedia - https://phabricator.wikimedia.org/T139226 [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:30] matt_flaschen: here you are ^ [23:45:40] Dereckson, thanks, confirmed without mw1017. [23:45:49] okay let's go for getCentralAuthToken now [23:47:56] getCentralAuthToken public → protected live on mw1017 [23:49:39] mutante and greg-g > for log access, I need to be added to ldap/nda group [23:50:55] mutante: Long tail on indexing. Last ~4% are the slowest. [23:54:25] Dereckson, CentralAuth typo fix confirmed. [23:54:56] matt_flaschen: could you check if all looks good in the logs? [23:58:09] Dereckson, yeah, grep is running now. [23:58:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:59:07] matt_flaschen: thanks [23:59:46] robh [23:59:47] * Guest23666 (bca69f56@gateway/web/cgi-irc/kiwiirc.com/ip.188.166.159.86) has joined [23:59:54] or greg-g ^^