[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T0000). Please do the needful. [00:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:37] * MaxSem can do it [00:01:14] here [00:01:44] MaxSem: Thanks! [00:02:04] (03PS3) 10MaxSem: Switch nowiki to uca-nb-u-kn collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330264 (owner: 10Kaldari) [00:06:20] (03CR) 10MaxSem: [C: 032] Switch nowiki to uca-nb-u-kn collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330264 (owner: 10Kaldari) [00:06:50] (03PS15) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [00:07:02] (03Merged) 10jenkins-bot: Switch nowiki to uca-nb-u-kn collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330264 (owner: 10Kaldari) [00:08:54] kaldari, pulled on mwdebug1002, please test [00:09:01] looking... [00:10:38] MaxSem: Looks good. Please sync. [00:11:09] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2915300 (10fgiunchedi) [00:11:11] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2915297 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi gdnsd metrics deployed [00:12:34] (03CR) 10jenkins-bot: Switch nowiki to uca-nb-u-kn collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330264 (owner: 10Kaldari) [00:13:06] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/330264/3 (duration: 00m 41s) [00:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:21] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2915303 (10fgiunchedi) [00:14:15] kaldari, ^ what should I do to regen? [00:14:16] (03CR) 10Dzahn: [C: 032] check_graphite: Fix some IndexError exceptions in Threshold.parse_result [puppet] - 10https://gerrit.wikimedia.org/r/330332 (https://phabricator.wikimedia.org/T154533) (owner: 10Alex Monk) [00:15:44] (03PS2) 10Dzahn: check_graphite: Fix some KeyError exceptions in SeriesThreshold.format_message [puppet] - 10https://gerrit.wikimedia.org/r/330329 (https://phabricator.wikimedia.org/T154533) (owner: 10Alex Monk) [00:16:29] (03CR) 10Dzahn: [C: 032] check_graphite: Fix some KeyError exceptions in SeriesThreshold.format_message [puppet] - 10https://gerrit.wikimedia.org/r/330329 (https://phabricator.wikimedia.org/T154533) (owner: 10Alex Monk) [00:17:01] (03PS2) 10Dzahn: check_graphite: Fix some IndexError exceptions in Threshold.parse_result [puppet] - 10https://gerrit.wikimedia.org/r/330332 (https://phabricator.wikimedia.org/T154533) (owner: 10Alex Monk) [00:20:28] (03PS16) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [00:22:58] anybody to look into merging https://gerrit.wikimedia.org/r/#/c/327907/ ? [00:23:54] SMalyshev: Add it to SWAT? :) [00:24:02] did so [00:25:07] Reedy: here https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0January.C2.A004 [00:25:25] Hmm. Looks like MaxSem was swatting [00:26:25] MaxSem: ping? [00:26:37] pong [00:26:58] MaxSem: https://gerrit.wikimedia.org/r/#/c/327907/ for SWAT? [00:27:10] (03PS5) 10MaxSem: Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [00:27:27] (03CR) 10MaxSem: [C: 032] Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [00:28:07] (03Merged) 10jenkins-bot: Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [00:28:17] (03CR) 10jenkins-bot: Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [00:29:14] cool, thanks! [00:29:52] SMalyshev, pulled to mwdebug1002 [00:30:41] (03PS2) 10Andrew Bogott: Add mirantis backports repo for Openstack classes on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/330319 [00:31:38] MaxSem: excellent, works! [00:31:55] (03CR) 10Andrew Bogott: [C: 032] Add mirantis backports repo for Openstack classes on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/330319 (owner: 10Andrew Bogott) [00:33:53] (03PS17) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [00:33:54] !log maxsem@tin Synchronized wmf-config/unitConversionConfig.json: https://gerrit.wikimedia.org/r/#/c/327907/5 (duration: 00m 40s) [00:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:06] SMalyshev, ^ please test [00:34:41] MaxSem: yup, still works, thank you! [00:34:51] :) [00:35:16] (03CR) 10Andrew Bogott: [C: 032] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [00:40:56] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2915355 (10fgiunchedi) [00:40:59] 06Operations, 05Prometheus-metrics-monitoring, 15User-Elukey: Port memcached statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147326#2915352 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi @elukey ok! we can keep the metrics for now, if it turns out to be a problem we can blac... [00:41:12] MaxSem: Can I add the fix for https://phabricator.wikimedia.org/T154548 (UBN) once it merges & cherry-picks? [00:41:34] sure [00:42:01] !log removed 2fa for account per T154171 [00:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:24] Thanks [00:46:52] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2915359 (10fgiunchedi) >>! In T152791#2894847, @ArielGlenn wrote: >>>! In T152791#2894804, @fgiunchedi wrote: >> @ArielGlenn indeed the stacked graphs are meant fo... [00:52:29] PROBLEM - Tool Labs instance distribution on labtestcontrol2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:39] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:29] (03PS1) 10Andrew Bogott: Revert "Keystone: Move api service to uwsgi/nginx" [puppet] - 10https://gerrit.wikimedia.org/r/330341 [01:01:01] PROBLEM - keystone-admin on labtestcontrol2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:30] PROBLEM - keystone-public on labtestcontrol2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:59] (03CR) 10Andrew Bogott: [C: 032] Revert "Keystone: Move api service to uwsgi/nginx" [puppet] - 10https://gerrit.wikimedia.org/r/330341 (owner: 10Andrew Bogott) [01:03:25] !log maxsem@tin Synchronized php-1.29.0-wmf.7/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/330338/ (duration: 00m 58s) [01:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:41] http://XN--80ADGFMAN1AA4L.xn--p1ai. [01:04:16] http://XN--B1AAJAMACM1DKMB.xn--p1ai [01:04:47] MaxSem: ^ [01:05:11] ? [01:05:30] remember those IDN domains? [01:05:38] it's Cyrillic [01:05:38] uh [01:05:51] wanna kill them? [01:05:53] there were like 8 of them [01:05:59] 6 are not registered anymore [01:06:06] 2 are left and pointing to us [01:06:19] i just wanna know if i kill 6 or all 8 basically :p [01:06:31] ask legal? [01:06:31] i wouldnt have any idea who the contact was [01:06:34] was this Ukraine? [01:06:37] they dont know [01:06:41] it's 3 people ago [01:06:42] or something [01:06:45] no, these are russian domains [01:07:01] ok [01:07:21] we'd kind of prefer to not have them [01:07:28] unless WMRU would be mad :p [01:07:33] and they never worked afaict? [01:08:42] yea, so the 2 remaining ones are org: Wikimedia Foundation, Inc. [01:09:32] i wonder if they are just random [01:09:43] or in any way "better" than the other 6 at https://gerrit.wikimedia.org/r/#/c/328604/ [01:11:40] PROBLEM - DPKG on labtestcontrol2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:12:31] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:12:50] RECOVERY - keystone-admin on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.086 second response time [01:13:20] RECOVERY - keystone-public on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.080 second response time [01:13:31] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.083 second response time [01:13:48] it's random, "wikiversity" is there and wikipedia but that is also the only one that points to parking [01:14:40] RECOVERY - DPKG on labtestcontrol2001 is OK: All packages OK [01:14:52] вікімедіа is wikimedia NOT wikipedia ? [01:15:04] then the wikipedia.org is missing in the first place.. oh well [01:15:20] RECOVERY - Tool Labs instance distribution on labtestcontrol2001 is OK: OK: All critical toollabs instances are spread out enough [01:15:30] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:16:10] mutante: вікімедіа is wikimedia and Ukrainian [01:16:48] ah, so i did remember there was something Ukrainian. thanks! [01:17:13] well, but none of them ever really worked afaict [01:17:26] .xn--p1ai are russian ones though [01:17:40] i see [01:17:45] at least the top domain is Russian [01:18:07] (I have no idea if they allow registering Ukrainian names there and whether anybody does that :) [01:18:30] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [01:18:50] PROBLEM - keystone-admin on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 35357: Connection refused [01:19:20] PROBLEM - keystone-public on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [01:20:20] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:22:24] SMalyshev: there were like 9 of those IDNs, 8 Russian 1 Ukrainian. 6 of them are expired and not registered, of the 3 existing ones 1 is a blackhole and the other 2 get you an error from our Apaches :p time to remove all of that [01:22:46] never heard about people using it but .. what do i know [01:22:50] RECOVERY - keystone-admin on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.081 second response time [01:23:20] RECOVERY - keystone-public on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.080 second response time [01:23:30] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.079 second response time [01:23:38] years ago was the request to add them [01:24:58] !log maxsem@tin Synchronized php-1.29.0-wmf.6/extensions/CentralAuth: https://gerrit.wikimedia.org/r/#/c/330345/ (duration: 00m 44s) [01:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:09] i even did that but it was before it was even in git [01:26:20] found a ticket where Wikimedia Ukraine asked .. hrmmm [01:30:11] (03PS2) 10Dzahn: remove IDNs that are not registered by us anymore [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) [01:32:14] (03PS3) 10Dzahn: remove IDNs that are not registered by us anymore [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) [01:32:38] (03PS4) 10Dzahn: remove Russian IDNs that are not registered by us anymore [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) [01:33:37] (03CR) 10Dzahn: [C: 032] remove Russian IDNs that are not registered by us anymore [dns] - 10https://gerrit.wikimedia.org/r/328604 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [01:34:39] (03PS1) 10Filippo Giunchedi: prometheus: temporary rsync server for metrics migration [puppet] - 10https://gerrit.wikimedia.org/r/330348 (https://phabricator.wikimedia.org/T148408) [01:35:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus: temporary rsync server for metrics migration [puppet] - 10https://gerrit.wikimedia.org/r/330348 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [01:38:58] (03PS2) 10Filippo Giunchedi: prometheus: temporary rsync server for metrics migration [puppet] - 10https://gerrit.wikimedia.org/r/330348 (https://phabricator.wikimedia.org/T148408) [01:39:26] (03PS1) 10Dzahn: park викиданные.рф (wikidata) & викиверситет.рф (wikiversity) [dns] - 10https://gerrit.wikimedia.org/r/330349 (https://phabricator.wikimedia.org/T137105) [01:41:13] there, Cyrillic in commit message and the bot and IRC client don't even break it [01:42:20] (03PS2) 10Dzahn: park викиданные.рф (wikidata) & викиверситет.рф (wikiversity) [dns] - 10https://gerrit.wikimedia.org/r/330349 (https://phabricator.wikimedia.org/T137105) [01:44:42] (03PS1) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330350 (https://phabricator.wikimedia.org/T150774) [01:46:25] (03CR) 10Dzahn: [C: 032] park викиданные.рф (wikidata) & викиверситет.рф (wikiversity) [dns] - 10https://gerrit.wikimedia.org/r/330349 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [01:49:22] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:04:16] 06Operations, 10Analytics: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10fgiunchedi) [02:17:26] (03PS1) 10Dereckson: Add throttle rules for January 2017 events in Maharashtra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330356 (https://phabricator.wikimedia.org/T154312) [02:17:30] mutante: are you still there? Can we deploy something in emergency? A throttle rule for an event in India today ^ [02:18:14] (and with timezone joys, it's 7h47 there, if their workshop is at 8am, we need the rule right now) [02:34:38] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 11m 17s) [02:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:42] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.124 second response time [02:42:42] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.628 second response time [02:44:31] Hi! Happy new year, all! :) Anyone know offhand where the code that generates the x_analytics header lives? thx!!!!! ;D [02:47:42] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.539 second response time [02:53:28] AndyRussG: https://github.com/wikimedia/mediawiki-extensions-XAnalytics/blob/master/XAnalytics.class.php [02:54:22] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:54:45] AndyRussG: and places like https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/4c91f4b6f839735f768f81f922a0381cd1c1ead6/WikimediaEventsHooks.php [02:55:15] AndyRussG: see also https://wikitech.wikimedia.org/wiki/X-Analytics [02:57:26] bd808: oooh that's very clever! cool thx :) [02:57:44] (03PS2) 10Dereckson: Add throttle rules for January 2017 events in Maharashtra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330356 (https://phabricator.wikimedia.org/T154312) [02:57:46] I was gonna go looking in Varnish code [02:57:50] Hey bd808, hello :) We would need https://gerrit.wikimedia.org/r/330356 merged right now (throttle rule for an event starting this morning). Could you help? ^ [02:59:29] Oooh I see there's even a link to the extension at the bottom of the doc page, sorry I missed that 8p [03:00:16] Dereckson: you just want a +1 or you want me to do the deploy? [03:00:56] If you can deploy it, that would be great. [03:02:24] Dereckson: Reedy and I are going to work on the extension to move all of this into the db and expose it as a special page next week. These patches are really a pain for the event organizers [03:02:56] Good news. [03:03:24] (03CR) 10BryanDavis: [C: 032] Add throttle rules for January 2017 events in Maharashtra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330356 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [03:03:59] (03Merged) 10jenkins-bot: Add throttle rules for January 2017 events in Maharashtra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330356 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [03:04:36] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 13m 05s) [03:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:29] bd808: note *all* the rules we added are multi wiki, to centralize that on meta. would be a *very good* idea [03:05:57] (probably because of the tradition to add Commons to every rule) [03:05:58] !log bd808@tin Synchronized wmf-config/throttle.php: Add throttle rules for January 2017 events in Maharashtra (T154312) (duration: 00m 42s) [03:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:02] T154312: Request for a temporary lift of account creation cap on IPs (2017-01-04,2017-01-06,2017-01-10) - https://phabricator.wikimedia.org/T154312 [03:06:38] (but indians event for example are typically local language + English, in Canada they have fr/en workshops, etc.) [03:07:11] Dereckson: yeah. that's on of the things we need to work out. the current extension is singel wiki. [03:07:40] T27000 is the epic for the work [03:07:41] T27000: Review and deploy ThrottleOverride extension to Wikimedia wikis - https://phabricator.wikimedia.org/T27000 [03:10:09] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 4 03:10:08 UTC 2017 (duration 5m 33s) [03:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:26] (03CR) 10jenkins-bot: Add throttle rules for January 2017 events in Maharashtra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330356 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [03:13:39] Dereckson: doesn't look like anything melted :) [03:17:05] Thanks for the deploy. [03:20:44] yw [03:22:22] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:26:58] (03CR) 10Andrew Bogott: [C: 032] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330350 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [03:29:02] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.32 seconds [03:30:02] PROBLEM - keystone http on labcontrol1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 78 bytes in 0.001 second response time [03:30:02] PROBLEM - keystone process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [03:31:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.43 seconds [03:32:03] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [03:42:44] (03PS1) 10Andrew Bogott: Revert "Keystone: Move api service to uwsgi/nginx" [puppet] - 10https://gerrit.wikimedia.org/r/330364 [03:44:32] PROBLEM - keystone process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [03:44:33] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 78 bytes in 0.073 second response time [03:52:02] RECOVERY - keystone http on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.004 second response time [03:52:02] RECOVERY - keystone process on labcontrol1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [04:00:37] PROBLEM - keystone-admin on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 78 bytes in 0.072 second response time [04:01:08] PROBLEM - keystone-public on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 78 bytes in 0.072 second response time [04:02:07] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:04:17] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [04:05:17] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1806.507057 Seconds [04:06:17] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 35.641591 Seconds [05:11:38] (03CR) 10Andrew Bogott: [C: 032] Revert "Keystone: Move api service to uwsgi/nginx" [puppet] - 10https://gerrit.wikimedia.org/r/330364 (owner: 10Andrew Bogott) [05:13:08] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[uwsgi],Service[keystone] [05:14:27] PROBLEM - keystone-admin on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 35357: Connection refused [05:14:37] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:14:57] PROBLEM - keystone http on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 5000: Connection refused [05:14:57] PROBLEM - keystone-public on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 5000: Connection refused [05:15:37] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [05:19:09] (03PS1) 10Papaul: DNS: Add mgmt and production DNS entries for elastic2025-elastic2036 Bug:T154251 [dns] - 10https://gerrit.wikimedia.org/r/330369 [05:19:51] (03PS1) 10Andrew Bogott: uwsgi: uwsgi should run as root, not as www-data [puppet] - 10https://gerrit.wikimedia.org/r/330370 (https://phabricator.wikimedia.org/T150774) [05:20:17] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:21:38] (03PS1) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [05:25:27] RECOVERY - keystone-admin on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.006 second response time [05:25:37] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [05:29:43] PROBLEM - keystone process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [05:31:43] RECOVERY - keystone process on labcontrol1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [05:32:03] RECOVERY - keystone http on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.004 second response time [05:32:13] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [05:35:03] PROBLEM - keystone http on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 5000: Connection refused [05:35:43] PROBLEM - keystone process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [05:45:43] RECOVERY - keystone process on labcontrol1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [05:46:03] RECOVERY - keystone http on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.004 second response time [05:46:23] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:49:23] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:54:33] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:01:13] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:08:46] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2915847 (10TTO) Scheduled for the next SWAT at 1400 UTC today. [06:15:39] (03PS4) 10TTO: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [06:16:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:17:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:20:47] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2915849 (10Papaul) [06:23:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:24:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:29:23] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:31:43] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:42:13] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:47:13] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:48:43] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [06:49:43] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:57:43] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[screen],Package[jq] [07:01:23] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [07:05:03] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:17:43] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:24:15] !log Compressing more tables on db1044 - T153826 [07:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:20] T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826 [07:26:43] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:31:23] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:34:04] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:34:13] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [07:43:56] !log Compressing tables on db1015 - T153739 [07:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:00] T153739: Defragment db1015 - https://phabricator.wikimedia.org/T153739 [07:49:44] (03PS1) 10Marostegui: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330376 (https://phabricator.wikimedia.org/T154031) [07:59:23] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:31:44] (03CR) 10Muehlenhoff: Initial debianization (039 comments) [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [08:40:24] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [08:42:14] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [08:48:50] <_joe_> zhuyifei1999_, Revent: the queue of transcodes is now reducing fast [08:49:08] <_joe_> but we have to find a way to throttle video2commons or this will happen again [08:52:22] _joe_: it might not, considering that apache bug was fixed [08:52:46] but suggestions welcome [08:56:59] I still believe having a ton of 1080p videos starting at the same time triggered the bug [09:02:30] <_joe_> well after we fixed the bugs, we were still accruing backlog [09:02:40] <_joe_> and with 4 servers instead of the traditional two [09:02:55] <_joe_> as I might have said, I think we need a lower-prio queue [09:07:33] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:11:34] (03CR) 10Muehlenhoff: [C: 031] "Seems fine. (The package is installed on Ubuntu systems via the ubuntu-standard meta package)." [puppet] - 10https://gerrit.wikimedia.org/r/328952 (owner: 10Dzahn) [09:19:09] (03CR) 10Muehlenhoff: [C: 031] "Seems fine, shall I build/upload or is there more to come?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [09:25:46] _joe_: I'd suggest to make 720p, 1080p, and future VP9 transcodes put into a "slow" low-priority queue [09:26:40] low-res gets transcoded first [09:27:27] and honestly, 360p & 480p videos are good enough for usual watching, if not watching the details [09:29:23] <_joe_> zhuyifei1999_: I'll try to hook up with the relevant people at the WMDS [09:30:38] brion: your input welcome :) [09:31:01] <_joe_> zhuyifei1999_: yeah with "people" I meant "brion and a couple others" [09:31:04] <_joe_> :P [09:31:21] ok [09:32:17] what's WMDS btw? I don't think I ever heard of that [09:32:39] <_joe_> https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit [09:32:45] <_joe_> sorry, acronyms :( [09:32:47] oh [09:32:49] <_joe_> I should know better [09:33:33] I know that summit, but never heard of that acronym :( thanks for explaining [09:34:03] <_joe_> zhuyifei1999_: it's just me being too concise, I hate using acronyms others might or might not get [09:34:27] <_joe_> but I have to prepare my slides for the event, so I'm in a hurry :P [09:34:31] oh can I restart v2c after the queue gets < 100? [09:34:33] k [09:34:37] <_joe_> yes [09:34:52] ok thx [09:34:53] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916063 (10jcrespo) > we will not be able to use jdbc for this as it is not supported, so we will need to set it server side. Again, this is not a... [09:34:53] <_joe_> but keep in mind next week almost no one will be around during EU times [09:35:37] my timezone is broken anyways [09:41:37] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2644059 (10Joe) I might be very unpopular with this opinion, but: I think emojis in code reviews should be avoided. Plain old written text is the w... [09:43:12] (03PS2) 10Muehlenhoff: role::mediawiki::jobrunner: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/320549 [09:47:01] (03CR) 10Muehlenhoff: [C: 032] role::mediawiki::jobrunner: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/320549 (owner: 10Muehlenhoff) [09:56:10] (03CR) 10Muehlenhoff: "If it minimises disruption, let's make the switch piece by piece, then? How about the week after WMDS/allhands?" [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff) [09:59:48] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10Gehel) Since hadoop does not seem to ever actually use logstash-gelf, all releva... [10:01:33] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2916089 (10Gehel) logstash-gelf has been upgraded on disk, but a cluster restart is still needed to pick up that... [10:02:10] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2916090 (10Gehel) [10:05:58] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Update logstash on wikimedia to 2.x or 5.x - https://phabricator.wikimedia.org/T154473#2916096 (10Gehel) For completeness, the upgrade to elasticsearch 5.x is tracked on T154501 [10:08:22] (03PS7) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) [10:10:05] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330376 (https://phabricator.wikimedia.org/T154031) (owner: 10Marostegui) [10:10:38] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330376 (https://phabricator.wikimedia.org/T154031) (owner: 10Marostegui) [10:10:49] (03CR) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330376 (https://phabricator.wikimedia.org/T154031) (owner: 10Marostegui) [10:12:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2060 - T154031 (duration: 00m 47s) [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] T154031: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031 [10:18:13] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:41] (03CR) 10Jcrespo: [C: 031] Add ferm service for mariadb_dbproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff) [10:42:57] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2916144 (10Marostegui) @Cmjohnson for this ticket, let's go one by one. Change one, we will let the raid rebuild and then change the other one. Thanks! [10:47:13] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:50:17] (03PS2) 10Muehlenhoff: Add ferm service for mariadb_dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/316341 [10:50:37] (03CR) 10jerkins-bot: [V: 04-1] Add ferm service for mariadb_dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff) [10:54:45] (03PS3) 10Muehlenhoff: Add ferm service for mariadb_dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/316341 [11:03:31] (03PS2) 10Nemo bis: Restore $wgMFEEditorOptions['anonymousEditing'] = true for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [11:06:29] (03CR) 10Nemo bis: [C: 031] "Fixed typo in commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [11:07:00] (03PS1) 10Urbanecm: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T1545688) [11:07:46] (03PS2) 10Urbanecm: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T1545688) [11:08:49] (03PS3) 10Urbanecm: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T154568) [11:10:05] (03CR) 10Alexandros Kosiaris: "I 'll give it to you that the current UX is subpar." [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) (owner: 10Gilles) [11:10:15] (03CR) 10Muehlenhoff: [C: 032] Make systemd-timesyncd available as an alternative time synchronisation provider [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:10:58] (03CR) 10Revi: "I knew there's a typo. Actually my intention was to say 'Revert the false' commit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [11:11:48] I can't commit inside the bus full of human [11:11:50] (03PS1) 10Hashar: build: bump bundler puppet version 3.4 -> 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/330390 (https://phabricator.wikimedia.org/T143233) [11:11:51] :-p [11:14:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This by definition makes running uwsgi currently way less secure. Making the user configurable would be preferable (and defaulting to www-" [puppet] - 10https://gerrit.wikimedia.org/r/330370 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [11:18:41] !log continuing maintenance on db1035 (mysql replication stopped) [11:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:45] (03CR) 10Alexandros Kosiaris: [C: 032] build: bump bundler puppet version 3.4 -> 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/330390 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [11:20:49] (03PS2) 10Alexandros Kosiaris: build: bump bundler puppet version 3.4 -> 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/330390 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [11:20:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] build: bump bundler puppet version 3.4 -> 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/330390 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [11:20:59] akosiaris: danke :) [11:21:22] hashar: de rien [11:21:40] I went mad yesterday and wanted to polish up a new documentation generation system for puppet.git [11:21:48] and eventually all bits connected at like 11pm :/ [11:26:25] (03PS1) 10Urbanecm: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) [11:28:53] (03PS1) 10Hashar: build: bump bundler rainbow dependency [puppet] - 10https://gerrit.wikimedia.org/r/330393 [11:33:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "what alex said; also, why does keystone need to run with the keystone user?" [puppet] - 10https://gerrit.wikimedia.org/r/330370 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [11:35:08] (03CR) 10Alexandros Kosiaris: [C: 032] build: bump bundler rainbow dependency [puppet] - 10https://gerrit.wikimedia.org/r/330393 (owner: 10Hashar) [11:49:04] (03CR) 10Dereckson: [C: 04-1] [throttle] Add rules for 2017-01-06 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [11:55:23] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members],Exec[wikidev_ensure_members],Exec[ops_ensure_members] [12:16:18] (03PS1) 10Muehlenhoff: Enable systemd-timesyncd on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/330400 [12:22:13] (03CR) 10Muehlenhoff: [C: 032] Enable systemd-timesyncd on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/330400 (owner: 10Muehlenhoff) [12:22:23] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:27:25] (03PS1) 10Urbanecm: Add pawiki's HD logo and fix two typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330401 (https://phabricator.wikimedia.org/T150618) [12:33:21] (03PS3) 10Revi: Revert '$wgMFEEditorOptions['anonymousEditing'] = false for kowiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) [12:36:33] PROBLEM - Check systemd state on multatuli is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:37:26] (03PS4) 10Revi: Revert '$wgMFEEditorOptions['anonymousEditing'] = false for kowiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) [12:38:20] (03PS1) 10Muehlenhoff: Purge ntp package when using systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330402 [12:40:57] (03PS6) 10Hashar: Puppet doc with strings/yard [puppet] - 10https://gerrit.wikimedia.org/r/309561 (https://phabricator.wikimedia.org/T143233) [12:45:16] (03CR) 10Alexandros Kosiaris: [C: 031] Purge ntp package when using systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330402 (owner: 10Muehlenhoff) [12:57:04] (03PS2) 10Gehel: osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) [12:59:23] PROBLEM - NTP on multatuli is CRITICAL: NTP CRITICAL: No response from NTP server [13:03:03] ^looking [13:05:07] (03PS1) 10Muehlenhoff: Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) [13:12:45] 06Operations, 10netops: cr2-esams<->cr2-eqiad link flaps - https://phabricator.wikimedia.org/T154577#2916308 (10faidon) [13:17:18] 06Operations, 10netops: cr2-esams<->cr2-eqiad link flaps - https://phabricator.wikimedia.org/T154577#2916362 (10faidon) This has been raised to Level3 as ticket #12023671. [13:19:31] (03CR) 10Gilles: "I've just looked at one of the first performance alerts Peter and Timo started defining:" [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) (owner: 10Gilles) [13:29:10] jouncebot: next [13:29:10] In 0 hour(s) and 30 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T1400) [13:29:38] hashar: half full eu swat today [13:29:55] easy :D [13:32:17] (03PS3) 10Gehel: osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) [13:37:52] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916378 (10Aklapper) (Same feelings here as Joe: Task could be declined.) [13:40:08] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916379 (10Paladox) But utf8 is not full unicode. It's better to just fix it then to leave it. Yes we shouldn't bug dba but this is a db thing. [13:42:48] (03Abandoned) 10Hashar: contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [13:46:45] (03PS4) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [13:46:56] 06Operations, 10Traffic, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2916386 (10ema) Because of the Log overrun issue we are actually losing quite a lot of information. I'm now comparing the values produced by the [[ https://gerrit... [13:49:24] (03PS4) 10Ema: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [13:51:12] (03CR) 10Muehlenhoff: [C: 032] Purge ntp package when using systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330402 (owner: 10Muehlenhoff) [13:53:18] zeljkof: need to grab a coffee etc [13:54:51] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2916409 (10Aklapper) >>! In T154205#2912984, @Paladox wrote: > Apparently the way they fixed it was the wrong... [13:59:03] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2916416 (10Paladox) I did where I said per Luca at https://gerrit-review.googlesource.com/#/c/93479/ [13:59:30] jouncebot: next [13:59:31] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T1400) [14:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T1400). Please do the needful. [14:00:05] revi, tto, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:09] huh [14:00:17] Ah, almost forgot [14:00:21] Present [14:00:27] o/ [14:00:55] hashar: what's the plan for eu swat today? want to take it? should I take it? [14:01:22] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2916419 (10akosiaris) Unfortunately, as discussed multiple times in the past, reprepro does not really allow us to easily have more than one version of a pa... [14:02:38] anybody? :P (My late dinner is ready) [14:02:55] revi, tto, Urbanecm: can you test your commits at mwdebug1002, once they are there? [14:02:59] (03PS1) 10Muehlenhoff: Don't apply NTP Icinga check to standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330411 [14:03:04] Yes, indeed [14:03:09] revi can go first :) [14:03:32] revi: preparing for swat and waiting for hashar to say if he wants to take over :) [14:03:38] I'm doing SWAT after about a year of absense so I dunno how to access mwdebug1002 [14:03:57] revi: let me see, there are docs somewhere [14:04:44] revi: here it is https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [14:05:08] got it [14:05:09] thanks [14:06:18] o/ [14:06:38] hashar: want to take swat today? should I do it? #together? [14:06:42] revi: we have browser extensions for firefox/chromium which lets one easily point to the debug servers [14:06:48] zeljkof: go for it :} [14:06:53] downloaded [14:06:56] the two throttle change can be pushed together [14:07:06] (configuring it to allow secret tab) [14:08:12] hashar: the question is, who does the swat? :) [14:08:26] zeljkof, no, throttle rules are untestable [14:08:33] (untestable in any way) [14:08:44] Urbanecm: yes, I remember now [14:08:59] I only repied to your question. [14:09:11] is my commit on mwdebug1002? [14:09:17] hashar, may I know what is secret tab? ;) [14:09:19] (03PS1) 10Hashar: Migrate puppet compiler instance from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/330412 (https://phabricator.wikimedia.org/T146381) [14:09:21] it seems I can edit on mobile without logged in, hmm [14:09:54] Urbanecm: yes, I remembered that throttle changes are untestable when you said it [14:09:59] revi: not yet [14:10:05] zeljkof, ok [14:10:26] ok, no reply from hashar, I guess I'll do the swat today [14:10:27] k, think there was a bug that allows my commit useless lol [14:10:33] Urbanecm: secret tab? [14:10:39] hashar, yes [14:11:02] incognito mode I meant [14:12:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [14:12:47] starting with revi's patch ^ [14:12:54] (03Merged) 10jenkins-bot: Revert '$wgMFEEditorOptions['anonymousEditing'] = false for kowiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [14:13:01] \o/ [14:13:04] (03CR) 10jenkins-bot: Revert '$wgMFEEditorOptions['anonymousEditing'] = false for kowiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [14:15:19] revi: should I push your commit to mwdebug1002? can you test it there? [14:15:28] I think so [14:15:46] I can't test it with my phone but I can test with my firefox [14:16:15] revi: ok, it's at mwdebug1002, please test and let me know if I can push to cluster [14:16:58] works fine [14:17:01] (03PS5) 10Zfilipin: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [14:17:14] revi: ok, pushing to the cluster [14:20:44] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:329446|Revert $wgMFEEditorOptions[anonymousEditing] = false for kowiki (T119823)]] (duration: 00m 41s) [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:49] T119823: Set $wgMFEditorOptions['anonymousEditing'] = true for kowiki - https://phabricator.wikimedia.org/T119823 [14:21:34] revi: pushed to production, please test [14:21:59] tto: your 277519 is next [14:22:08] I'm here [14:22:24] done zeljkof [14:22:29] worksfine (tm) [14:22:30] hashar: had some strange scap warning :| https://phabricator.wikimedia.org/P4703 [14:22:34] revi: great! [14:22:45] zeljkof: yeah filled a task about it [14:22:48] working as intended ™ [14:22:50] in short [14:22:55] some plugin fails to import the "sh" modules [14:22:57] but that is ignored [14:23:01] I mean [14:23:06] the plugin failing to load is ignored [14:23:06] hashar: huh [14:23:06] good [14:23:15] Chad has a patch for it [14:24:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [14:24:38] (03Merged) 10jenkins-bot: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [14:24:49] (03CR) 10jenkins-bot: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [14:24:57] tto: can you test the patch at mwdebug1002, once it is ther? [14:24:59] there? [14:25:03] YEs [14:25:37] tto: great, will ping you in a minute [14:27:25] tto: it's there, please test and let me know if I can push to production [14:27:43] Will test [14:28:27] (03PS4) 10Zfilipin: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T154568) (owner: 10Urbanecm) [14:30:56] zeljkof, all is well [14:31:12] tto: ok, pushing to production [14:31:33] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:32:45] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:277519|Set valid content language for Norwegian wikis (T126146)]] (duration: 00m 41s) [14:32:46] (03PS5) 10Ema: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [14:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:49] T126146: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146 [14:32:55] tto: pushed, please test [14:33:11] looking [14:33:36] Urbanecm: deploying your commits [14:34:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T154568) (owner: 10Urbanecm) [14:34:44] zeljkof, I'm here :) [14:35:14] Urbanecm: well, there's nothing you can do, right? :) [14:35:24] is there a way to test the changes? [14:35:44] zeljkof, they are throttles so not [14:36:07] Urbanecm: well, in that case, will deploy and pray ;) [14:36:14] zeljkof: Works [14:36:19] Thanks! [14:36:22] tto: yeah! :) [14:37:32] ok [14:38:55] Urbanecm: zuul is busy with wikibase... https://integration.wikimedia.org/zuul/ [14:42:01] well, jenkins is busy, but still the commit is in the queue... [14:47:01] Okay. [14:47:09] bah [14:47:09] Am I required to do something? [14:47:12] it lacks instances [14:47:18] Urbanecm: no [14:47:20] Okay [14:47:21] you are free [14:47:37] I will merge and deploy your commits, as soon as jenkins allows it [14:51:02] (03Merged) 10jenkins-bot: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T154568) (owner: 10Urbanecm) [14:51:13] (03CR) 10jenkins-bot: Throttle rules for 2017-01-06/07, tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330388 (https://phabricator.wikimedia.org/T154568) (owner: 10Urbanecm) [14:51:28] (03PS6) 10Ema: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [14:51:38] (03PS2) 10Zfilipin: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [14:54:00] hashar, Urbanecm: I have just noticed that https://gerrit.wikimedia.org/r/#/c/330392/ has one -1 from Dereckson [14:54:28] (03PS1) 10Hashar: (DO NOT SUBMIT) rake doc on CI [puppet] - 10https://gerrit.wikimedia.org/r/330418 [14:54:32] zeljkof: and ? [14:54:35] zeljkof: skip the patch! [14:54:45] No. Fix the patch :) [14:54:52] ok, will deploy the first one then [14:54:57] I'll fix it. [14:55:09] Urbanecm: ok, please fix it, while I deploy the first one [14:56:29] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:330388|Throttle rules for 2017-01-06/07, tewiki (T154568)]] (duration: 00m 40s) [14:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:33] T154568: Raise throttling cap on user registration, image upload on commons.wikimedia.org and te.wikipedia.org on 2017-01-06 to 2017-01-07 - https://phabricator.wikimedia.org/T154568 [14:56:59] Fied [14:57:02] *Fixed [14:57:19] (03PS3) 10Urbanecm: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) [14:57:21] (03CR) 10Urbanecm: "Done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [14:57:25] Urbanecm: ok, deploying [14:57:51] thx [14:59:13] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:00:33] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:28] !log extending eu swat until 330392 is merged [15:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] (03CR) 10jerkins-bot: [V: 04-1] [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:07:27] Urbanecm: unit tests failed :| https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/11432/console [15:07:37] Invalid parameter in a throttle rule detected: ip [15:11:03] Urbanecm, hashar: 330392 has failed jobs and we are out of time, let's leave it for another swat window, ok? [15:11:13] na [15:11:14] just do it [15:11:26] let me check [15:11:49] oh [15:11:49] 15:06:12 Invalid parameter in a throttle rule detected: ip [15:11:50] 15:06:12 Failed asserting that an array contains 'ip'. [15:11:50] 15:06:12 [15:13:41] But 117.211.27.115 is a ip, isn't it? [15:13:51] zeljkof, should I schedule it for tomorrow? [15:13:54] (03PS4) 10Hashar: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:13:58] Urbanecm: IP [15:14:00] hashar: I am really not familiar with throttling, I am hesitant to make the change myself [15:14:14] hashar, it should be capital? [15:14:20] Urbanecm: later today, or tomorrow is fine with me [15:14:31] we have throttling change almost every week. There is bunch of inline documentation in the comments :D [15:14:35] Urbanecm: ok, hashar made the change [15:14:38] and tests to cover the throttle definition [15:14:57] Say my thank you to the tests. [15:15:04] They didn't exist before some time... [15:15:05] (03CR) 10Hashar: [C: 032] [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:15:25] I'll read the docs more precise next time. [15:15:27] I promise. [15:15:30] :) [15:15:38] :D [15:16:18] (03Merged) 10jenkins-bot: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:17:04] Urbanecm: well there are tests [15:17:07] so who care about the doc :} [15:17:12] try something, if that fails read the doc ! [15:17:21] patches on mwdebug1002 [15:17:21] (03PS20) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [15:17:27] Urbanecm, hashar: I am deploying the change [15:17:30] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330328 (https://phabricator.wikimedia.org/T154112) (owner: 10Gergő Tisza) [15:17:32] hashar, it isn't testable, is it? [15:17:41] I usually browse en.wikipedia.org [15:17:45] and hit special random page [15:17:48] just in case :} [15:18:11] (03CR) 10jenkins-bot: [throttle] Add rules for 2017-01-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330392 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [15:18:27] looks good [15:18:41] hashar: did you deploy it? [15:19:01] !log hashar@tin Synchronized wmf-config/throttle.php: (no message) (duration: 00m 41s) [15:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:17] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/330418 (owner: 10Hashar) [15:20:19] !log EU SWAT finished [15:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:33] \O/ [15:22:27] (03CR) 10Hashar: "Gave it a try by making 'rake test' to point to the :doc task on https://gerrit.wikimedia.org/r/#/c/330418/ and yard managed to run/gen" [puppet] - 10https://gerrit.wikimedia.org/r/309561 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [15:26:11] (03Draft1) 10Paladox: Gerrit: Enable port 4560 so it can connect to prod logstash [puppet] - 10https://gerrit.wikimedia.org/r/330420 (https://phabricator.wikimedia.org/T141324) [15:26:14] (03Draft2) 10Paladox: Gerrit: Enable port 4560 so it can connect to prod logstash [puppet] - 10https://gerrit.wikimedia.org/r/330420 (https://phabricator.wikimedia.org/T141324) [15:27:51] (03Abandoned) 10Paladox: Gerrit: Enable port 4560 so it can connect to prod logstash [puppet] - 10https://gerrit.wikimedia.org/r/330420 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:32:29] (03CR) 10Chad: [C: 032] gerrit (2.13.4-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [15:33:13] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:15] (03CR) 10Chad: "(whoops, wrong button)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [15:40:50] anyone up to land a rake/CI change on puppet.git ? Would like to switch puppet doc generation from broken 'puppet rdoc' to a new modern system (puppet-strings / yard) [15:41:04] change https://gerrit.wikimedia.org/r/#/c/309561/ , noop for prod and the related CI job seems to pass all fine [15:41:13] and hopefully we will get some nice doc generated [15:43:06] (03PS21) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [15:43:13] (03PS22) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [15:43:47] (03Abandoned) 10Hashar: (DO NOT SUBMIT) rake doc on CI [puppet] - 10https://gerrit.wikimedia.org/r/330418 (owner: 10Hashar) [15:48:44] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:51:40] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2916581 (10Marostegui) Chris has replaced 32:0 disk (which is part of the SPAN #0) It is rebuilding now: ``` root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL Rebuild Progress on Device at Enc... [15:54:16] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916586 (10demon) What @paladox said. Yes, this was filed in context of emojis, but really it's about all extended unicode that mysql/maria's "utf8"... [15:55:33] (03PS23) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [15:59:18] (03PS1) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [16:00:05] who called operations/production branch this https://github.com/wikimedia/operations-puppet/tree/fuckportblocks lol [16:02:13] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:05:05] (03PS24) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [16:05:08] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T154418#2916620 (10Cmjohnson) @fgiunchedi The disk has been replaced and has been added back cmjohnson@ms-be1001:~$ sudo megacli -CfgForeign -Scan -a0 There are 1 foreign configuration(s) on controll... [16:06:54] (03PS25) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [16:08:23] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [16:10:13] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [16:10:55] (03CR) 10Gehel: [C: 031] "We did a few iterations with paladox. This now looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:11:00] (03CR) 10jerkins-bot: [V: 04-1] Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [16:11:35] volans: ^ Yeah! A public repo full of magic! [16:11:56] (03CR) 10Paladox: "> We did a few iterations with paladox. This now looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:11:59] gehel: :-P [16:12:00] (03CR) 10Paladox: [C: 031] Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:12:55] (03CR) 10Chad: "Please don't merge just yet, I want to amend this a tad (not a major change, but want it a little more useful when the logstash host is un" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:13:23] RECOVERY - MegaRAID on db1033 is OK: OK: optimal, 1 logical, 2 physical [16:13:33] RECOVERY - MegaRAID on ms-be1006 is OK: OK: optimal, 13 logical, 13 physical [16:15:03] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error [16:15:22] gehel: volans: cumin is almost happy with tests :} Mind bothering you to land a puppet.git change that change the way doc is generated ? :} [16:15:40] https://gerrit.wikimedia.org/r/#/c/309561/ switch puppet doc generation from broken 'puppet rdoc' to a new modern system (puppet-strings / yard) and gives us NICER doc :} [16:16:38] hashar: I can take a look later, after some meeting [16:16:49] ;D [16:16:55] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1001 - https://phabricator.wikimedia.org/T154396#2916675 (10Cmjohnson) @fgiunchedi Disk at slot 2 was added back cmjohnson@ms-be1001:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online,... [16:19:14] (03CR) 10Gehel: [C: 031] "@chad: sure, not merging this." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:19:49] (03PS2) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [16:19:57] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2916705 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Disk 7 was replaced: cmjohnson@db1033:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online,... [16:21:01] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2916710 (10Marostegui) Span #0 is now rebuilt. ``` root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL Device(Encl-32 Slot-0) is not in rebuild process Device Present... [16:21:53] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2916715 (10Marostegui) Thanks!! [16:22:22] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916718 (10Paladox) Oh, is that mysql binary or is it utf8 binary? [16:23:48] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2916721 (10Marostegui) 32:2 has been replaced and it is getting rebuilt ``` root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 2% in... [16:24:22] hashar: I did not know about puppet-strings yet (always good to learn something). Doesn't it require a specific documentation format inline? (I have not tried, just read the docs) [16:25:23] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 23 failures. Last run 2 minutes ago with 23 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/zotero for deploy-service],Package[zsh-beta],Package[coreutils],Package[quickstack] [16:27:53] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:22] (03CR) 10Ema: [C: 032] varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:28:36] (03CR) 10Hashar: "After I have made that change, I eventually subscribed to Debian rust-maintainers list at https://lists.alioth.debian.org/mailman/listinfo" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [16:31:54] (03PS10) 10Eevans: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [16:32:11] (03CR) 10Eevans: RESTBase-Cassandra: Add the topk reporter (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [16:34:09] gehel: yeah by default it uses markdown, but I tweaked it to use the "rdoc" markup format [16:34:17] gehel: eventually we will want to switch :D [16:40:15] gehel: volans don't worry will follow tomorrow. Gotta rush out [16:40:26] hashar: have fun! [16:40:33] ok cya [16:40:48] you can still give it a try and look at the generated doc. It is rather nice :} [16:41:44] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916768 (10Dzahn) >>! In T145885#2916065, @Joe wrote: > I think emojis in code reviews should be avoided. > I honestly don't see why this should ev... [16:42:41] (03CR) 10Eevans: RESTBase-Cassandra: Add the topk reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [16:43:39] (03CR) 10Chad: "Disabling is my preferred route (and what I'm working on an amended patch for). In that case, it'll go back to writing them on-disk like i" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:43:43] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:47:57] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916773 (10jcrespo) > this is a db thing To prove this is **not** a DB server issue, but a client code/driver bug: ``` $ mysql -h m2-master review... [16:51:29] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916775 (10demon) >>! In T145885#2916773, @jcrespo wrote: >> this is a db thing > > To prove this is **not** a DB server issue, but a client code/d... [16:52:23] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:53:01] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2916799 (10ArielGlenn) That looks great and covers my use cases. Thanks! [16:55:22] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916800 (10demon) >>! In T145885#2915101, @Paladox wrote: > Apparently jdbc does not support utf8mb4 > > https://www.google.co.uk/#q=fatal:+++cause... [16:55:53] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:55:55] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916801 (10jcrespo) > altering existing tables/data Altering tables and is not a problem, changing global configuration that affects other database... [16:58:28] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2916808 (10Jdforrester-WMF) [16:58:33] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10MediaWiki-extensions-ContentTranslation, and 5 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2916807 (10Jdforrester-WMF) [16:58:50] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10Jdforrester-WMF) [16:59:33] RECOVERY - MegaRAID on db1048 is OK: OK: optimal, 1 logical, 2 physical [17:02:03] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [17:03:05] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Update logstash on wikimedia to 2.x or 5.x - https://phabricator.wikimedia.org/T154473#2916820 (10Deskana) [17:03:16] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 2.x or 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10Deskana) [17:03:24] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10Deskana) [17:03:58] 07Puppet, 10MediaWiki-Vagrant: mediawiki/vagrant puppet classes "3d" are illegal with puppet - https://phabricator.wikimedia.org/T154594#2916826 (10hashar) [17:04:13] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10Deskana) I will assume we'd be going with 5.x to keep Logstash at the same version as other things in our ELK stack. [17:04:34] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2916840 (10Deskana) p:05Triage>03Normal [17:04:54] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916841 (10Paladox) >>! In T145885#2916800, @demon wrote: >>>! In T145885#2915101, @Paladox wrote: >> Apparently jdbc does not support utf8mb4 >> >... [17:05:56] (03CR) 10Chad: [C: 031] "Hmm, well we already disable sending as of the current page. I'd kinda like to disable the on-disk logs when we *are* sending, but I guess" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:06:09] (03CR) 10Chad: [C: 031] "s/page/patch" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:07:41] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916844 (10demon) >>! In T145885#2916841, @Paladox wrote: > oh so doing characterEncoding=utf8mb4 works? Because when i tried that it failed to sta... [17:08:32] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916846 (10Paladox) Oh, continuing to use utf-8 works but pasting an emoji still fails with 500 error. [17:10:48] (03CR) 10Gehel: [C: 031] "@Chad: yeah, my preference is also to keep on disk logs. When things break, it's nice to have them..." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:11:15] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2916852 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris!! ``` root@db1048:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Device(Encl-32 Slot-2) is not in rebuild proces... [17:14:34] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2916861 (10Cmjohnson) [17:15:24] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818166 (10Cmjohnson) we do not have any spare disks at this time. We will need to order a few spare. [17:18:08] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2916878 (10Dzahn) @paladox What was the solution that resulted in it _not_ showing a 500 error but just a ? instead. Afair there was one? That would... [17:19:39] (03PS1) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [17:20:43] (03CR) 10jerkins-bot: [V: 04-1] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [17:22:43] (03PS1) 10BBlack: unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 [17:22:45] (03PS1) 10BBlack: unified cert: use digicert in esams [puppet] - 10https://gerrit.wikimedia.org/r/330438 [17:23:28] (03CR) 10jerkins-bot: [V: 04-1] unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 (owner: 10BBlack) [17:24:10] bleh :P [17:25:28] (03PS2) 10BBlack: unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 [17:25:29] (03PS2) 10BBlack: unified cert: use digicert in esams [puppet] - 10https://gerrit.wikimedia.org/r/330438 [17:25:34] (03PS2) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [17:28:57] (03CR) 10Volans: "inline comment for discussion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [17:34:46] (03PS5) 10BBlack: TLS: reduce scope of stream.wm.o redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/328193 (https://phabricator.wikimedia.org/T143925) [17:34:48] (03PS23) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:34:50] (03PS23) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:34:52] (03PS24) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:34:54] (03PS8) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [17:47:03] !log Ran manual DB update to officewiki for T153320. [17:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:43] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#2916946 (10Jgreen) I ~think~ both of the original 3TB disks have failed. I was able to confirm that we replaced the correct disk from the before+after megacli reports, but the hardware RAID... [17:49:43] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#2916960 (10Jgreen) p:05Triage>03Normal changing status to normal, the task is now "remove both 3TB disks from barium" [17:51:23] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:53:24] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10faidon) Early January is here and the 15th is coming up fast ­­-- @yuvipanda rightfully mentioned above that this will need a (presumably a... [17:55:47] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2916980 (10yuvipanda) @jcrespo How does Jan 25 / 26 work for you? [18:03:48] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2917046 (10jcrespo) >>! In T123731#2916980, @yuvipanda wrote: > @jcrespo How does Jan 25 / 26 work for you? +1. let's meet at some point to organize... [18:05:33] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 725 bytes in 0.778 second response time [18:06:13] RECOVERY - keystone-public on labtestcontrol2001 is OK: HTTP OK: HTTP/1.0 300 Multiple Choices - 735 bytes in 0.074 second response time [18:17:33] (03PS1) 10Jdrewniak: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330450 (https://phabricator.wikimedia.org/T128546) [18:25:54] (03CR) 10Filippo Giunchedi: [C: 032] gerrit (2.13.4-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [18:26:33] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[exim4-daemon-light],Package[exim4-config] [18:27:03] PROBLEM - Disk space on krypton is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [18:27:03] PROBLEM - Disk space on ununpentium is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied [18:29:03] RECOVERY - Disk space on ununpentium is OK: DISK OK [18:30:21] ^ those were caused by exim4 upgrades by me [18:30:31] ok, I was scratching my head [18:30:36] on how that could have happened [18:30:37] !log rolling out exim4 upgrades (DSA 3747-1) on misc servers [18:30:38] made no sense [18:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:45] sorry, i had the log line already open [18:30:54] oh, absolutely no issue [18:31:03] I just was like wat? [18:31:31] i did it on the "misc-ops" debdeploy group of servers, which is why it's those few random ones... to make sure first [18:34:03] (03CR) 10Filippo Giunchedi: "Built and uploaded to carbon" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [18:36:05] (03CR) 10Chad: "Thanks!" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/330255 (https://phabricator.wikimedia.org/T154205) (owner: 10Chad) [18:39:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [18:40:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BR [18:42:01] urgh [18:42:09] hopefully level3 investigating [18:42:10] bblack: ^ [18:44:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [18:44:43] (03PS2) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [18:44:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [18:45:40] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [18:48:37] (03PS3) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [18:48:43] PROBLEM - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 [18:48:45] ACKNOWLEDGEMENT - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T154603 [18:48:48] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10ops-monitoring-bot) [18:52:55] paravoid: I temporarily lost my IRC bouncer connectivity around the same time [18:52:59] might be a broad issue [18:54:33] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:57:04] (03PS4) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [18:58:00] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [18:58:57] (03PS5) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T1900). [19:00:04] jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:01:05] I can SWAT jan_drewniak ping when you're around. [19:01:09] o/ [19:01:15] okie doke :) [19:02:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330450 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:03:40] (03Merged) 10jenkins-bot: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330450 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:03:53] (03CR) 10jenkins-bot: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330450 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:05:19] jan_drewniak: live on mwdebug1002 if you can check there [19:06:06] thcipriani: yup, looks good [19:06:17] jan_drewniak: ok, running sync-portals [19:08:24] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:330450|Bumping portal to master]] T128546 (duration: 00m 43s) [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:28] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [19:08:35] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2917289 (10Paladox) @Dzahn i if we do just &connectionCollation=utf8mb4_unicode_ci that should at least fix the 500 errors but emojis won't show. [19:09:08] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:330450|Bumping portal to master]] T128546 (duration: 00m 42s) [19:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:14] ^ jan_drewniak should be live [19:10:52] thcipriani: yup, thanks! [19:10:58] yw! [19:12:59] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10fgiunchedi) We've seen this in the past with {T152339} and it looked like controller failure [19:21:21] (03Abandoned) 10Andrew Bogott: uwsgi: uwsgi should run as root, not as www-data [puppet] - 10https://gerrit.wikimedia.org/r/330370 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [19:24:29] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T154418#2917346 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson ! disk is rebuilding [19:25:03] RECOVERY - check_ssl on barium is OK: SSL OK - Certificate civicrm.wikimedia.org valid until 2018-02-09 01:41:03 +0000 (expires in 400 days) [19:25:09] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2917350 (10debt) Added a meeting to chat about this with Legal and the Interactive team on Thursday, Jan 5, 2017 [19:26:14] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:14] (03PS2) 10Rush: nfs-mounts: Remove wikidata-quality from nfs-mount yaml [puppet] - 10https://gerrit.wikimedia.org/r/330173 (owner: 10Madhuvishy) [19:27:19] (03PS3) 10Rush: nfs: Dual mount misc projects from labstore-secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/329711 (https://phabricator.wikimedia.org/T154336) (owner: 10Madhuvishy) [19:27:22] (03PS2) 10Rush: nfs: Clean up post tools nfs migration [puppet] - 10https://gerrit.wikimedia.org/r/329707 (owner: 10Madhuvishy) [19:28:56] (03PS1) 10Faidon Liambotis: labs: 2>/dev/null dns-floating-ip-updater's output [puppet] - 10https://gerrit.wikimedia.org/r/330453 (https://phabricator.wikimedia.org/T149574) [19:29:14] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2917389 (10Paladox) @Dzahn and @demon setting ?useUnicode=true should stop the 500 errors but emojis won't show for now (db conversion required too) :) [19:30:03] (03CR) 10Faidon Liambotis: [C: 032] labs: 2>/dev/null dns-floating-ip-updater's output [puppet] - 10https://gerrit.wikimedia.org/r/330453 (https://phabricator.wikimedia.org/T149574) (owner: 10Faidon Liambotis) [19:30:13] RECOVERY - check_ssl on thulium is OK: SSL OK - Certificate payments-listener.wikimedia.org valid until 2018-02-09 20:31:03 +0000 (expires in 401 days) [19:30:13] RECOVERY - check_ssl on saiph is OK: SSL OK - Certificate payments-listener.wikimedia.org valid until 2018-02-09 20:31:03 +0000 (expires in 401 days) [19:30:16] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2917395 (10Paladox) https://bugs.mysql.com/bug.php?id=57694 [19:30:25] (03PS2) 10Faidon Liambotis: labs: 2>/dev/null dns-floating-ip-updater's output [puppet] - 10https://gerrit.wikimedia.org/r/330453 (https://phabricator.wikimedia.org/T149574) [19:30:28] (03PS1) 10Chad: First iteration at making git-fat somewhat legible [debs/git-fat] - 10https://gerrit.wikimedia.org/r/330454 [19:30:30] (03CR) 10Faidon Liambotis: [V: 032 C: 032] labs: 2>/dev/null dns-floating-ip-updater's output [puppet] - 10https://gerrit.wikimedia.org/r/330453 (https://phabricator.wikimedia.org/T149574) (owner: 10Faidon Liambotis) [19:31:20] (03CR) 10Andrew Bogott: [C: 032] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [19:31:27] (03PS6) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/330371 (https://phabricator.wikimedia.org/T150774) [19:32:14] 06Operations, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2917405 (10faidon) [19:32:16] 06Operations, 06Labs, 13Patch-For-Review: cronspam from labstores, labcontrol, labstestservices - https://phabricator.wikimedia.org/T149574#2917402 (10faidon) 05Open>03Resolved a:03faidon I don't think there's anything else to be done for this and it seems to have been largely ignored anyway. Resolving. [19:32:18] (03Draft1) 10Paladox: Gerrit: Set useUnicode=true [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [19:32:21] (03Draft2) 10Paladox: Gerrit: Set useUnicode=true [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [19:33:05] (03PS3) 10Paladox: Gerrit: Set useUnicode=true [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [19:34:43] RECOVERY - keystone-admin on labtestcontrol2001 is OK: HTTP OK: HTTP/1.0 300 Multiple Choices - 737 bytes in 0.076 second response time [19:35:03] ostriches ^^ :) [19:35:13] RECOVERY - check_ssl on mintaka is OK: SSL OK - Certificate civicrm.wikimedia.org valid until 2018-02-09 01:41:03 +0000 (expires in 400 days) [19:37:33] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 78 bytes in 0.072 second response time [19:37:34] (03CR) 10Paladox: [C: 031] "I think we can merge this now?" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:37:43] PROBLEM - keystone-admin on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 78 bytes in 0.072 second response time [19:38:33] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 725 bytes in 0.075 second response time [19:38:43] RECOVERY - keystone-admin on labtestcontrol2001 is OK: HTTP OK: HTTP/1.0 300 Multiple Choices - 737 bytes in 0.079 second response time [19:38:43] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:39:37] 06Operations, 06Labs, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2917434 (10yuvipanda) [19:47:33] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:50:59] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: codfw: elastic2025-elastic2036/switch port configuration - https://phabricator.wikimedia.org/T154605#2917474 (10Papaul) [19:51:26] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: codfw: elastic2025-elastic2036/switch port configuration - https://phabricator.wikimedia.org/T154605#2917474 (10Papaul) [19:52:11] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905389 (10Papaul) [19:53:18] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2917500 (10Cmjohnson) okay... i will have to call HP. [19:55:14] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:55:50] (03CR) 10Thcipriani: [C: 031] "much nicer to read" [debs/git-fat] - 10https://gerrit.wikimedia.org/r/330454 (owner: 10Chad) [19:59:42] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1001 - https://phabricator.wikimedia.org/T154396#2917518 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson ! disk is rebuilding [20:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T2000). Please do the needful. [20:01:16] * thcipriani does [20:03:33] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:04:39] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330459 [20:04:41] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330459 (owner: 10Thcipriani) [20:05:20] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330459 (owner: 10Thcipriani) [20:05:24] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2917549 (10Dzahn) a:03Dzahn [20:05:31] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330459 (owner: 10Thcipriani) [20:05:57] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.7 [20:07:09] (03PS1) 10Filippo Giunchedi: grafana: add 503 breakdown to varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/330460 [20:08:07] hrm seeing a lot of fatals for: Access level to Wikibase\Repo\Diff\EntityContentDiffView::getRevisionHeader() must be public [20:08:13] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [20:08:23] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [20:08:33] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [20:08:43] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [20:09:07] that looks like a labs NFS failure on both servers .. uhmm [20:09:13] andrewbogott: ? [20:10:54] added a task for the error https://phabricator.wikimedia.org/T154609 [20:11:00] chasemp ^^ [20:12:16] !log rollback 1.29.0-wmf.7 from group1 for T154609 [20:12:36] mutante: it's probably wikitech having issues? those mechanisms all access silver or the wikitech api afaict [20:12:45] and it seems thcipriani is rolling something back, related? [20:12:53] just cause i saw both labstore1001 and 1002 [20:13:01] chasemp: not related as far as I'm aware [20:13:02] yeah wikitech is down or blank [20:13:06] thcipriani: ^ [20:13:12] rolling back now [20:13:15] assume from deploy [20:13:23] yup that'd do it [20:13:37] chasemp: *nod* thanks [20:13:38] also seeing puppet failure alerts on #labs [20:13:57] madhuvishy: that's generally from the wiki puppet ENC backend I imagine [20:14:02] yeah [20:14:33] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Rollback 1.29.0-wmf.7 from group1 wikis [20:14:33] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (67470 200000s) [20:14:43] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [20:15:03] thcipriani: whatever the reason you rolled back may need to add wikitech thinsg to the list? :) [20:15:06] yea, i didn't mean the sync thing, just the "nfs-exportd is activating" part [20:15:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exportd is active [20:15:23] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (67510 200000s) [20:15:26] chasemp: /me nods [20:15:33] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:15:42] thanks thcipriani [20:17:42] chasemp: madhuvishy I guess we should rewrite nfs-exportsd to use the openstack api instead [20:17:54] yuvipanda: yep [20:17:59] agreed [20:18:02] yuvipanda: I think krenair put a patch up somewhere already [20:18:06] madhuvishy: ^ [20:18:09] aah [20:18:12] I did [20:18:22] It should've shown up in the operations/puppet review queue [20:18:56] (03PS1) 10Thcipriani: Revert "group1 wikis to 1.29.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330463 [20:19:12] most of the stuff needed was there except for python3 versions of the openstack client packages [20:19:20] https://gerrit.wikimedia.org/r/#/c/328609/ [20:19:21] which I left a mail for andrew on the ops list about [20:19:29] (03CR) 10Thcipriani: [C: 032] Revert "group1 wikis to 1.29.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330463 (owner: 10Thcipriani) [20:19:52] also did someone just break beta's puppet? [20:20:07] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330463 (owner: 10Thcipriani) [20:20:10] Krenair: fallout from wikitech outage I imagine [20:20:16] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330463 (owner: 10Thcipriani) [20:20:28] well I still see failures [20:20:31] *reads up* [20:20:57] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Reading data from Hosts/deployment-cache-text04 failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:40 on node deployment-cache-text04.deployment-prep.eqiad.wmflabs [20:21:13] hm [20:21:21] that'll be the realm-from-hiera line [20:21:46] (03PS2) 10Chad: First iteration at making git-fat somewhat legible [debs/git-fat] - 10https://gerrit.wikimedia.org/r/330454 [20:21:48] (03PS1) 10Chad: Pull in all upstream changes from https://github.com/jedbrown/git-fat/blob/master/git-fat [debs/git-fat] - 10https://gerrit.wikimedia.org/r/330464 [20:23:00] thcipriani: I'm guessing your going to want that Wikibase thing fixing? :) [20:23:18] addshore: yes please :) [20:23:30] *goes to do it* [20:24:13] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:56] oh wtf, what is this [20:26:02] Error: Could not apply complete catalog: Found 1 dependency cycle: [20:26:02] (Exec[acme-setup-acme-beta_wmflabs_org] => Letsencrypt::Cert::Integrated[beta.wmflabs.org] => Service[nginx] => Exec[acme-setup-acme-beta_wmflabs_org]) [20:26:05] anyone seen that before? [20:26:59] thcipriani: what branch of Wikibase is current deployed? [20:27:04] / Wikidata [20:27:11] it only started failing a quarter of an hour ago so ... [20:28:33] addshore: https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json#L171 [20:28:41] 1.29.0-wmf.5 [20:28:45] ahh okay [20:28:55] anybody here who knows scap3 and can help me with some questions? [20:29:27] SMalyshev: I may be able to help. What's up? [20:29:46] thcipriani: so scap3 has a way to generate config files on deploy [20:30:05] thcipriani: is there a way to generate a config file inside deploy directory? [20:31:03] SMalyshev: no, not currently. There is a ticket for that. Right now scap3 can't do un git-tracked things in the deployed directory. [20:32:08] thcipriani: to avoid things getting confusing, how do you feel about me making a branch for both Wikibase and Wikidata called wmf/1.29.0-wmf5.1 including the patch we need? and wmf6 etc already exist with a bunch for changes included... [20:33:17] thcipriani: I see. so I have this situation: I have a deployment repo that I want to use bundled config file when running locally/on labs, but generated config file when used on production. The problem is I can't easily change config file (i.e. it has to be the same path) [20:33:19] otherwise wmf5 of core and wmf5 of wikibase/data would have a similar problem [20:33:55] thcipriani: so is there something I can do with scap3? maybe there's some conditional deployment or post-deploy script or whatever? [20:34:24] so, anyone going to do anything about this? [20:34:30] addshore: oh boy. I think that should be fine. [20:34:38] thcipriani: okay, will do! [20:35:50] Krenair: I missed the context, what's up? [20:35:54] ah [20:36:19] SMalyshev: hrm, so you want to overwrite the existing config file that is part of a repo only in production? [20:36:30] thcipriani: basically, yes [20:36:54] !log rolling out exim4 upgrades (DSA 3747-1) on stat* and kubernetes hosts [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] or, alternatively, dynamically generate it maybe... but that's the same [20:37:46] scap3 can definitely dynamically write a config file as part of a deployment, the part that gets tricky is placing it in the directory as part of deployment since scap3 is swapping symlinks [20:38:05] lemme check when the config file is placed... [20:38:10] thcipriani: exactly, that's where I have problems... [20:38:41] thcipriani: when we tried to just use relative path for config we've got permission error so that doesn't work [20:39:22] !log rolling out exim4 upgrades (DSA 3747-1) on mw-canary hosts [20:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:22] mutante: if you have sec to look at deployment-cache-text04.deployment-prep.eqiad.wmflabs it seems to have manifested as depedency cycle with the Letsencrypt stuff [20:40:35] mutante: has anything changed there recently or? it's super odd it would just crop up [20:42:25] chasemp: i don't know about any change (or even that we use it on deployment-prep), my experience is limited to the setup for Gerrit and RT, but let me see [20:42:29] SMalyshev: so the config files scap3 writes live at [deployment directory]/.git/config-files/[config-file-path] and they get linked to their final location right before the final symlink swap happens where [repo] -> [repo]-cache/revs/[current deployed rev] [20:42:39] is something with wikibase broken? [20:43:23] SMalyshev: so the generated config file is present at /srv/deployment/[repo]/.git/config-files/[config-file-name] after the promote stage. [20:44:03] aude: there was a spike in errors after I rolled forward 1.29.0-wmf.7 https://phabricator.wikimedia.org/T154609 [20:44:06] hey aude :) [20:44:17] aude: I'm on it >> https://gerrit.wikimedia.org/r/#/c/330468/1 [20:44:20] if hiera('do_acme', true) { then exec { "acme-setup-acme-${safe_title}": which has require => Service[$puppet_svc], [20:44:40] thcipriani: hmmm that is very specialized path so I'm not sure that helps a lot by itself... scap3 doesn't have any post-deploy script capability? [20:44:51] i'm confused wmf/1.29.0-wmf.5.1 [20:45:15] well, if I update the wmf5 branch then that would be incompatible with the wmf5 branch of core? [20:45:35] SMalyshev: it does. You can run a command after any stage. So you can run a script after the promote stage to modify files in their final locations. [20:45:36] * aude looks to see what we have now [20:45:48] aude: wmf6 seems to be a fair way ahead of 5 [20:46:04] * addshore was going for the minimalistic approach ;) but now your here I'll let you have a look! [20:46:06] SMalyshev: you'd do that with command checks https://doc.wikimedia.org/mw-tools-scap/scap3/checks.html#command-checks [20:46:12] we are wmf.7 core ? [20:46:36] aude: well, we just went wmf7 of core on group1 and rolled back now [20:46:38] chasemp: i am on that instance now, yea, i see the dependency error. so you say it just started to break recently? [20:47:02] or is it different now because something changed about the Hiera side [20:47:03] thcipriani: aha, that may help! [20:47:19] i really wasnt involved in the nginx setup there [20:47:21] i think wmf.5 wikibase but not sure [20:47:24] mutante: Krenair noticed it 30m or so ago, it's possible it had been hiding out possibly [20:47:33] Aude, yes wmf5 wikibase [20:47:36] mutante: yeah me neither, I haven't actually seen this code before [20:47:47] so i know the "acme" stuff, that would be ok [20:47:53] it works in production to use that to get LE certs [20:47:55] the dependencies shown don't appear to match up with the code [20:47:58] with Apache [20:48:03] SMalyshev: see https://phabricator.wikimedia.org/diffusion/LSTD/browse/master/scap/checks/virtualenv.sh for an example of abusing the checks system in scap3 to do post-deploy tasks [20:48:07] but thought hoo deployed the week after [20:48:07] SMalyshev: both striker and ORES use those hooks in interesting ways, so those examples may be helpful https://github.com/wikimedia/labs-striker-deploy/blob/master/scap/checks.yaml [20:48:08] the part i dont know is the nginx setup here [20:48:10] thcipriani: so which stage I should specify if I want it to happen after all symlinks etc. are done? promote? [20:48:15] heh, yeah, what bd808 said [20:48:24] yup promote [20:48:31] mutante, feel free to log into deployment-cache-text04 and explore there [20:48:48] SMalyshev: the control file that fires that script is https://phabricator.wikimedia.org/diffusion/LSTD/browse/master/scap/checks.yaml [20:49:00] !log adding temporary IP tables rule on labservices1001 to drop traffic from toolchecker for tests (T152369) [20:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:03] T152369: toolschecker fell to pieces when labs-ns0 went down - https://phabricator.wikimedia.org/T152369 [20:49:32] bd808: thank you! [20:50:23] mutante: my assumption is that their is a require on the Letsencrypt::Cert::Integrated from wherever $puppet_svc is defeind for this [20:50:28] but why it would show up today [20:50:31] bd808: I see DEPLOY_DIR is hardcoded there - scap does not tell it to you? [20:51:12] aude: let me know the plan of action! ;) [20:51:22] SMalyshev: good question. I don't know if it puts anything in the environment or not. [20:51:38] I would guess not but would love to be wrong [20:52:06] bd808: ok... will hardcode for now :) [20:52:19] it probably wouldn't be too hard to add or to derive from the script file location itself [20:52:24] trying to see what was actually deployed because release tools still says wmf.5 wikibase [20:53:04] scap does not give you a DEPLOY_DIR in the env of a check, although that would be an excellent feature :) [20:53:13] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:53:31] stashbot: hello? [20:53:31] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:53:41] i think wmf.5 is good [20:53:42] https://phabricator.wikimedia.org/diffusion/MW/browse/wmf%252F1.29.0-wmf.7/.gitmodules [20:53:47] https://phabricator.wikimedia.org/diffusion/MW/browse/wmf%252F1.29.0-wmf.6/.gitmodules [20:55:03] (03PS1) 10Hashar: build: update rubocop to 0.37 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 [20:58:54] (03PS2) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 [20:59:18] (03CR) 10Hashar: "PS2 bumps rubocop to 0.39 :]" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [20:59:50] aude: will resubmit to wmf5 of wikibase then, but still in my mind that means wmf5 of core and wmf5 of wikibase together would be broken! [20:59:52] bd808: I don't know how to chatops. But I made a task. [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170104T2100). [21:00:07] addshore: then wmf.7 [21:00:17] ? [21:00:36] !log smalyshev@tin Starting deploy [wdqs/wdqs@3762556]: (no message) [21:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:58] not sure [21:01:13] thcipriani: heh. I've thought of making a !task command for stashbot but then decided that it would be too likely to make noise in phab [21:01:16] aude: well, for wmf7 if your sure we can deploy that now then sure! :P [21:01:29] Again, as you and hoo wern't here I was just going for a bare minimum fix ;) [21:01:29] addshore: it's different than wmf5 [21:01:35] !log smalyshev@tin Finished deploy [wdqs/wdqs@3762556]: (no message) (duration: 00m 59s) [21:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:38] i think what happened is hoo created a new branch but it never was deployed [21:01:47] that method is already public in wmf6/7 of wikibase [21:01:48] bd808: tasks at the speed of IRC are a dangerous thing. [21:01:58] wmf5.1 is fine [21:03:03] thcipriani: code review over irc is best tho [21:03:07] aude: okay, so I'll +2 on wmf5.1 and then make a new build also for the 5.1 branch of Wikidata? :) [21:03:46] gave +@ [21:03:51] epic! :) [21:04:10] if you can make the new build to include wmf.5.1 wikibase that would be good [21:04:21] +2 [21:04:45] Yup, I'll give it a shot once that merges! :) [21:05:01] ostriches: definitely [21:06:47] thanks [21:08:20] bah, aude how do I do the "grant branch" / composer update for only the single package? [21:08:33] We should really include that in the build-resources readme! [21:08:45] composer update --prefer-dist -o --no-dev wikibase/wikibase [21:08:50] epic! :) [21:10:32] thcipriani: is there a change on wikitech today that was not reverted already? [21:10:59] mutante: wikitech should be back to how it started the day [21:11:13] we are seeing errors on instances that look like they try Hiera lookups from Wikitech and it fails [21:11:19] but i dont really know [21:11:29] Error 400 on SERVER: Reading data from Hosts/deployment-cache-text04 failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:40 [21:12:21] hrm, I rolled it forward to wmf.7 now back on wmf.6 https://wikitech.wikimedia.org/wiki/Special:Version [21:12:38] "Reading data from Hosts/..." is that Hiera ? [21:12:43] the special page, i mean [21:12:52] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2917759 (10Aklapper) >>! In T145885#2916586, @demon wrote: > this was filed in context of emojis, but really it's about all extended unicode that my... [21:13:04] !log smalyshev@tin Starting deploy [wdqs/wdqs@3762556]: (no message) [21:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:12] stuff like https://wikitech.wikimedia.org/wiki/Hiera:Staging/host/thcipriani-mediawiki is [21:13:55] 39 if $realm == undef { [21:13:55] 40 $realm = hiera('realm', 'production') [21:14:09] it says at line 40 [21:14:18] so it thinks $realm is undef [21:14:19] uhm [21:14:31] no good [21:14:55] well it tries to lookup what the realm is , and can't [21:15:03] and prod is default [21:15:44] !log smalyshev@tin Finished deploy [wdqs/wdqs@3762556]: (no message) (duration: 02m 40s) [21:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:44] thcipriani: aude https://gerrit.wikimedia.org/r/#/c/330479/1 just waiting for that to merge and then wmf7 could be deployed with wmf5.1 of the Wikidata extension [21:19:08] addshore: awesome, thanks :) [21:19:40] thanks addshore :) [21:21:49] !log bsitzmann@tin Starting deploy [mobileapps/deploy@c39bd1f]: Update mobileapps to b43c5d6 [21:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:28] we are trying to restart the deployment puppetmaster in case it's caching for the mw hiera backend [21:24:44] !log bsitzmann@tin Finished deploy [mobileapps/deploy@c39bd1f]: Update mobileapps to b43c5d6 (duration: 02m 55s) [21:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:33] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Package[sysstat],Package[lldpd],Package[ncdu],Package[dstat] [21:34:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:37:02] thcipriani: wmf/1.29.0-wmf.5.1 of Wikidata all ready ;) [21:37:16] addshore: nice! Thank you! [21:38:06] I'm still having trouble finding any errors from silver that would explain a blank page on wikitech due to wmf.7 though and I don't think it was the same root issue:(( [21:38:27] oooh [21:39:30] mw errors flatline when I deployed so it, I guess, was hhvm stuff https://logstash.wikimedia.org/goto/e32ffdeb83bc01febf7c4ee82c472e8f [21:39:53] but the wikidata error was the only new thing I saw in fatal monitor [21:40:00] (03PS1) 10Andrew Bogott: Keystone: puppetize logging.conf [puppet] - 10https://gerrit.wikimedia.org/r/330566 [21:40:24] https://gerrit.wikimedia.org/r/#/c/327465/2 apparently caused the dependency issue on deployment-cache [21:41:19] but it did not happen before because ... deployment-puppetmaster not having the same code as prod master... we dont know [21:41:44] 13:40 < Krenair> reflog shows a rebase today along with references to https://gerrit.wikimedia.org/r/#/c/312523/ [21:42:28] I think a blank page on wikitech would explain the other error we were seeing [21:42:33] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2896288 (10Nuria) Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory... [21:42:45] still not sure how this letsencrypt dependency kept hidden for so long [21:43:16] yes, exactly [21:43:38] one error was the Hiera thing, but this dependency change is different [21:43:57] so i assume deployment-puppetmaster was out of date ? [21:44:04] and then it was fixed.. getting the new code [21:44:17] maybe [21:44:18] but the thing is [21:44:24] but /usr/local/bin/git-sync-upstream has logs showing it ran today at 06:30AM and had nothing to update [21:45:11] I got the puppet failure mail at 20:58 [21:45:36] (03CR) 10Andrew Bogott: [C: 032] Keystone: puppetize logging.conf [puppet] - 10https://gerrit.wikimedia.org/r/330566 (owner: 10Andrew Bogott) [21:46:29] https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1483566379.519&target=deployment-prep.deployment-cache-text04.puppetagent.failed_events&from=00%3A00_20161214&until=23%3A59_20170104 [21:46:44] !log otto@tin Starting deploy [eventstreams/deploy@a103be2]: (no message) [21:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:01] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2917868 (10Dzahn) [21:48:06] (03PS1) 10Andrew Bogott: Keystone: Add service notifications for a couple of files [puppet] - 10https://gerrit.wikimedia.org/r/330578 [21:48:15] hrm, maybe some things needed to be restarted before the error surfaced? Machine state change [21:48:16] !log otto@tin Finished deploy [eventstreams/deploy@a103be2]: (no message) (duration: 01m 32s) [21:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:10] hrm. there are 0 log events after I move wmf.7 to wikitech there are errors if you move that time window a bit in either direction, but while wikitech was serving a blank page it doesn't seem to have sent anything to logstash either https://logstash.wikimedia.org/goto/799c3d7ed1a86b535206cffa83094509 [21:50:38] makes it a bit hard to troubleshoot :( [21:50:59] mutante, there are a *lot* of puppet changes [21:51:06] which makes it seem like maybe puppet wasn't running [21:51:43] PROBLEM - keystone process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all [21:52:03] PROBLEM - keystone http on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 5000: Connection refused [21:52:55] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2917889 (10Krinkle) 05Open>03Resolved a:03Krinkle [21:53:03] RECOVERY - keystone http on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 725 bytes in 0.003 second response time [21:53:09] Krenair: yes, so it was probably broken since 3 weeks [21:53:13] yeah [21:53:22] It got: [21:53:27] * A change in dhparam.pem [21:53:28] Krenair: i would expect shinken to tell us .. but ... [21:53:33] does it have history [21:53:44] * Several new ntp restrict lines [21:54:00] * Removed some varnish varnishprocessor file [21:54:11] * created a varnish cachestats.py file [21:54:23] * various other varnish things [21:54:33] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[uwsgi-keystone-admin],Service[uwsgi-keystone-public] [21:55:08] !log mwscript deleteEqualMessages.php --wiki nowiki (T45917) [21:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:12] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:55:15] !log mwscript deleteEqualMessages.php --wiki nowikinews (T45917) [21:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:28] * New root: bd808 [21:57:27] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:59:37] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:02:37] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:03:14] (03PS1) 10Andrew Bogott: Keystone: rotate uwsgi logs [puppet] - 10https://gerrit.wikimedia.org/r/330579 (https://phabricator.wikimedia.org/T150774) [22:04:11] (03CR) 10Andrew Bogott: [C: 032] Keystone: Add service notifications for a couple of files [puppet] - 10https://gerrit.wikimedia.org/r/330578 (owner: 10Andrew Bogott) [22:05:58] anyone with any wikitech knowledge able to shine any light on https://phabricator.wikimedia.org/T154618 ? Somehow 1.29.0-wmf.7 hopelessly broke wikitech in a way that no other group1 wikis broke and there is nothing in logstash :( [22:06:02] (03CR) 10Andrew Bogott: [C: 032] Keystone: rotate uwsgi logs [puppet] - 10https://gerrit.wikimedia.org/r/330579 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [22:09:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "Agreed on naming bikeshedding. We'd need to go with sth like restbase-dev and rename the hosts too to keep things consistent" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [22:11:59] (03PS1) 10Andrew Bogott: Keystone: Make /var/log/keystone group writeable [puppet] - 10https://gerrit.wikimedia.org/r/330582 (https://phabricator.wikimedia.org/T150774) [22:13:41] (03CR) 10Andrew Bogott: [C: 032] Keystone: Make /var/log/keystone group writeable [puppet] - 10https://gerrit.wikimedia.org/r/330582 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [22:17:56] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2918014 (10RobH) [22:17:59] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: codfw: elastic2025-elastic2036/switch port configuration - https://phabricator.wikimedia.org/T154605#2918012 (10RobH) 05Open>03Resolved ports setup [22:19:58] !log rolling out exim4 upgrades (DSA 3747-1) on mw-codfw [22:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:37] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2918030 (10fgiunchedi) [22:23:40] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test1001 fourth ssd not detected - https://phabricator.wikimedia.org/T153139#2918028 (10fgiunchedi) 05Open>03Resolved Given the comment in https://phabricator.wikimedia.org/T151075#2873409 all disks are now detected, resolving [22:24:54] thcipriani: that should have been fixed before... the SemanticForms rename blew stuff up when ostriches was running a previous train. Maybe the branch cut script didn't get updated properly? [22:25:57] bd808: I got to the root of it https://phabricator.wikimedia.org/T154618 semanticforms just has a textfile in the extensions/SemanticForms dir. Figuring out the right way to update now :( [22:26:37] (03PS3) 10BBlack: unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 [22:26:39] (03PS3) 10BBlack: unified cert: use digicert in esams [puppet] - 10https://gerrit.wikimedia.org/r/330438 [22:26:47] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2918041 (10fgiunchedi) [22:26:48] branching script was pointed at a tag not a branch and did the wrong thing, obviously. [22:26:48] mostly I thought we decided not to follow master for it [22:27:22] (03CR) 10BryanDavis: [C: 031] Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [22:27:34] (03PS1) 10Filippo Giunchedi: Rename restbase-test1* to restbase-dev1* [dns] - 10https://gerrit.wikimedia.org/r/330584 (https://phabricator.wikimedia.org/T151075) [22:28:04] (03PS1) 10Andrew Bogott: Keystone: Purge eventlet settings from config [puppet] - 10https://gerrit.wikimedia.org/r/330585 (https://phabricator.wikimedia.org/T150774) [22:30:48] (03CR) 10Andrew Bogott: [C: 032] Keystone: Purge eventlet settings from config [puppet] - 10https://gerrit.wikimedia.org/r/330585 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [22:31:10] (03PS1) 10Filippo Giunchedi: install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330588 (https://phabricator.wikimedia.org/T151075) [22:31:48] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10RobH) [22:33:44] (03PS4) 10BBlack: unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 [22:34:04] (03CR) 10BBlack: [V: 032 C: 032] unified cert: infra for per-dc switching [puppet] - 10https://gerrit.wikimedia.org/r/330437 (owner: 10BBlack) [22:35:32] (03PS4) 10BBlack: unified cert: use digicert in esams [puppet] - 10https://gerrit.wikimedia.org/r/330438 [22:38:24] (03CR) 10Volans: "Thanks for all the fixes, looks much better now. Some replies to your comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [22:38:47] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:39:38] (03CR) 10BBlack: [V: 032 C: 032] unified cert: use digicert in esams [puppet] - 10https://gerrit.wikimedia.org/r/330438 (owner: 10BBlack) [22:39:43] (03CR) 10Filippo Giunchedi: [C: 032] Rename restbase-test1* to restbase-dev1* [dns] - 10https://gerrit.wikimedia.org/r/330584 (https://phabricator.wikimedia.org/T151075) (owner: 10Filippo Giunchedi) [22:40:17] (03CR) 10Filippo Giunchedi: [C: 032] install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330588 (https://phabricator.wikimedia.org/T151075) (owner: 10Filippo Giunchedi) [22:40:23] (03PS2) 10Filippo Giunchedi: install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330588 (https://phabricator.wikimedia.org/T151075) [22:41:45] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330588 (https://phabricator.wikimedia.org/T151075) (owner: 10Filippo Giunchedi) [22:43:57] aude: addshore if either of you folks are still around could you doublecheck me on: https://gerrit.wikimedia.org/r/#/c/330590/ [22:44:40] (03PS1) 10Filippo Giunchedi: install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330592 (https://phabricator.wikimedia.org/T151075) [22:45:28] (03PS2) 10Filippo Giunchedi: install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330592 (https://phabricator.wikimedia.org/T151075) [22:45:36] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: rename restbase-test1* to restbase-dev1* [puppet] - 10https://gerrit.wikimedia.org/r/330592 (https://phabricator.wikimedia.org/T151075) (owner: 10Filippo Giunchedi) [22:46:01] thcipriani: on my phone, the gitmodule file looks fine [22:46:24] thcipriani: as long as that new hash is the hash for that patch I made then all should be good! (: [22:46:28] addshore: ok, thanks, will merge :) [22:46:46] thcipriani: feel free to ping me for anything else :) [22:46:53] !log TLS: unified certificates in esams switching to digicert [22:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:11] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2918168 (10cscott) FWIW, the offline content generation service (OCG, generates PDFs, ZIM files, books, etc) also has a bespoke jo... [22:48:45] (03CR) 10Eevans: [WIP]: Enable Cassandra on restbase-test100[1-3] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [22:49:09] !log rename / reimage restbase-test1* to restbase-dev1* [22:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:53] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2075.codfw.wmnet [22:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:58] (03PS3) 10Eevans: [WIP]: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) [22:51:02] * robh starts killing servers [22:51:04] weeeee [22:51:25] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2076.codfw.wmnet [22:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:38] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2077.codfw.wmnet [22:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:51] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2078.codfw.wmnet [22:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:14] !log all my server depools and decoms for the mw range are on T154621 [22:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:18] T154621: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621 [22:53:20] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918179 (10RobH) depooled mw2075-2079, will get to the rest post-meeting. [22:54:57] (03CR) 10Chad: [C: 031] "If this works as a stopgap to make the characters insert as ?? instead of just erroring out, I think that's a good enough solution for now" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [22:56:05] (03CR) 10Paladox: "> If this works as a stopgap to make the characters insert as ??" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [22:57:14] (03CR) 10Paladox: "Can probably do this during the gerrit maint as this requires a restart." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:02:50] (03PS4) 10Eevans: [WIP]: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) [23:06:42] !log rolling out exim4 upgrades (DSA 3747-1) on mw-eqiad [23:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:47] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [23:08:02] (03PS5) 10Eevans: [WIP]: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) [23:08:26] (03PS2) 10Aaron Schulz: Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 [23:09:27] (03CR) 10Aaron Schulz: [C: 032] Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [23:09:31] !log rolling out exim4 upgrades (DSA 3747-1) on memcached-canary, memcached-codfw, restbase-codfw [23:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:50] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10cscott) Note that Parsoid also loads a bunch of configuration from mediawiki at star... [23:09:54] (03CR) 10Paladox: "Probably want to convert the db at the same time just to be on the safe side." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [23:10:08] (03Merged) 10jenkins-bot: Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [23:10:19] (03CR) 10jenkins-bot: Include DB shard as a logstash column [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [23:11:41] !log aaron@tin Synchronized wmf-config/logging.php: Include DB shard as a logstash column (duration: 00m 41s) [23:11:56] PHP fatal error: [23:11:56] Argument 1 passed to __invoke() must be an instance of array, string given [23:12:03] AaronSchulz: did you just break everything? :( [23:12:09] (i see that in production) [23:12:38] ...can confirm, prod's broken [23:12:42] greg-g: ^ [23:13:17] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.020 second response time [23:13:17] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.021 second response time [23:13:17] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.022 second response time [23:13:17] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50573 bytes in 0.042 second response time [23:13:18] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.045 second response time [23:13:18] Yeah the whole site is down [23:13:20] haha - the error message currently displayed in prod points people to #wikipedia [23:13:24] so it's filling up [23:13:27] PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.020 second response time [23:13:27] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.022 second response time [23:13:27] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.021 second response time [23:13:27] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50573 bytes in 0.025 second response time [23:13:27] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.020 second response time [23:13:27] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50525 bytes in 0.019 second response time [23:13:27] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.024 second response time [23:13:36] AaronSchulz: You just broke everything it looks like? [23:13:37] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50525 bytes in 0.019 second response time [23:13:37] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.023 second response time [23:13:37] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50525 bytes in 0.018 second response time [23:13:38] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.024 second response time [23:13:38] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50521 bytes in 0.032 second response time [23:13:47] PROBLEM - Nginx local proxy to apache on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50562 bytes in 0.019 second response time [23:13:47] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.021 second response time [23:13:47] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50520 bytes in 0.021 second response time [23:13:47] PROBLEM - Nginx local proxy to apache on mw1298 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50539 bytes in 0.027 second response time [23:13:47] PROBLEM - Nginx local proxy to apache on mw1199 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50562 bytes in 0.031 second response time [23:13:54] How did the canary thing not find this? [23:13:58] RoanKattouw: please just revert [23:14:07] you know it's good when the error-reporting bot gets flood-kicked [23:14:22] Working on it [23:14:36] RoanKattouw: ack, I'm here let me know how I can help [23:14:36] pages are coming in fyi [23:14:49] Revert on tin, sync-file, then worry about gerrit [23:14:51] RoanKattouw: did you revert now? [23:15:00] Doing it now [23:15:16] just noticed that browsing VP/T [23:15:35] https://www.irccloud.com/pastebin/ZpzgcIOk/ [23:15:40] Holy cow so many pages! This is… the think you're reverting, I take it? [23:15:46] Yes I'm reverting [23:15:53] And it's waiting for canaries [23:15:55] !log catrope@tin Synchronized wmf-config/logging.php: revert (duration: 00m 41s) [23:16:00] How the hell did the canaries not catch these errors? [23:16:07] RoanKattouw: nice [23:16:09] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 70999 bytes in 0.068 second response time [23:16:10] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 70999 bytes in 0.096 second response time [23:16:10] RECOVERY - HHVM rendering on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 70991 bytes in 0.268 second response time [23:16:10] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 70991 bytes in 0.282 second response time [23:16:10] RECOVERY - Nginx local proxy to apache on mw2190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.192 second response time [23:16:10] RECOVERY - Nginx local proxy to apache on mw2213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.191 second response time [23:16:10] RECOVERY - HHVM rendering on mw2225 is OK: HTTP OK: HTTP/1.1 200 OK - 70989 bytes in 0.245 second response time [23:16:14] Back. [23:16:35] RoanKattouw: is there a canary bypass btw? I thought there was a --force or something. [23:16:46] RoanKattouw: probably because they were hhvm errors and not mediawiki errors. hhvm errors would cause a lot of false positives so they're not taken into account. Either that or not enough traffic [23:16:47] ostriches: I gotta go, but could you follow up on two things? 1) How did canaries miss this, 2) why does scap sync-file give me that ImportError above [23:16:56] --force will bypass the canaries [23:16:58] I'm seeing recoveries on pages too RoanKattouw fyi and ostriches [23:17:10] the import error is harmless and known [23:17:23] !log rolling out exim4 upgrades (DSA 3747-1) on memcached-canary, memcached-codfw, restbase-codfw (for completeness, this was before the unrelated outage) [23:18:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:18:58] AaronSchulz: I suppose so, but it takes more time to find the docs for that then just wait it out [23:19:43] bd808: that should be "return $record". That seems VERY familiar...like something that was fixed in a PS ages ago. Maybe the push failed or something. [23:19:48] mutante: stashbot wasn't here, !log it again [23:19:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:20:24] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [23:20:39] !log rolled out exim4 upgrades (DSA 3747-1) on memcached-canary, memcached-codfw, restbase-codfw, cp-codfw [23:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:43] MatmaRex: thx [23:21:34] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:22:36] bd808: I see loads of packfile errors in my local git repo...time to nuke/rebuild [23:22:44] AaronSchulz: crap. I did not read your patch closely. you are adding a closure that acts as a Monolog filter and yes it needs to return the full record and not just a random string [23:22:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [23:23:51] AaronSchulz: you want something like lines 87-88 in your new filter [23:23:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:24:13] bd808: yeah, I thought that was fixed already. Very "deja vu". [23:24:27] the import error is harmless and known [23:24:34] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [23:24:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [23:24:36] I don't think it's really fair to describe bugs like that as harmless. [23:24:40] 06Operations: Production error message points users to donate link, that is likely to also produce the same error message - https://phabricator.wikimedia.org/T154627#2918235 (10MC8) [23:24:44] Yvette: Unrelated. [23:24:49] He meant within scap [23:24:51] why does esam keep going of? [23:24:53] It's warning about an Import [23:25:07] ostriches: Right, and it clearly causes confusion. [23:25:20] https://phabricator.wikimedia.org/P4701 [23:25:23] paladox: it wasnt local to esams, it was all [23:25:24] It's still harmless, and unrelated to what just happened. [23:25:33] Oh [23:25:38] Full context is: "as you're reverting, ignore that warning" [23:25:46] (For which, I'll add, I already have a fix) [23:25:48] Who said it was related? [23:26:10] You're the one who brought it up and said it's not harmless [23:26:11] When it is :) [23:26:21] I mean that it caused confusion for Roan. [23:26:25] And probably others. [23:26:39] Saying "oh, that exception is just one we throw all the time" doesn't mean it's harmless. [23:26:45] we should probably just remove scap branch from mediawiki config [23:26:45] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:46] It causes harm if it causes confusion. [23:26:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:27:03] We have different opinions of the word harm then [23:27:04] because it is a dumb error [23:27:15] thcipriani: Or, you know, roll out 3.5.x [23:27:26] why not both? :) [23:27:36] I wouldn't quite go as far as Yvette to say that it isn't harmless, but leaving exceptions around that are known to be not a problem is not OK [23:27:45] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 70982 bytes in 4.083 second response time [23:27:48] RoanKattouw: Yes, which I said I already have a fix for [23:27:52] It's in the new scap version [23:27:55] Cool. [23:28:25] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:28:30] RoanKattouw: I'll make a follow up and revert to the remote branch...as soon as finishes and I can do that [23:28:45] PROBLEM - Nginx local proxy to apache on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:25] what's up with mw1201? [23:29:35] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:29:45] RECOVERY - Nginx local proxy to apache on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 8.442 second response time [23:29:55] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:30:05] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:30:11] restbase-dev is me [23:30:15] test race I guess to get the mixed recovery and failure [23:30:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:30:35] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:30:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:36:05] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [23:36:15] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#2918252 (10fgiunchedi) [23:37:04] I'm going to hold finishing the train until tomorrow. It's already running way late and we just had an outage. At this point group0 is 1.29.0-wmf.7 all other wikis are 1.29.0-wmf.6 [23:38:18] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2918257 (10GWicke) [23:38:34] 06Operations, 10ops-eqiad: Rename/relabel restbase-test1* to restbase-dev1* - https://phabricator.wikimedia.org/T154629#2918272 (10fgiunchedi) [23:39:31] AaronSchulz: fyi: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/136290/console [23:40:35] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2918289 (10GWicke) With T152221 resolved it seems that only T152220 remains to be done until we can call this done. @Esanders, @mobrovac, @Jdforrester-WMF, could you t... [23:42:25] (03PS1) 10Volans: Keyholder: add dummy keys for Cumin [labs/private] - 10https://gerrit.wikimedia.org/r/330600 (https://phabricator.wikimedia.org/T154588) [23:44:11] (03PS1) 10Mattflaschen: Disable NewUserMessage gomwiki to prevent corruptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330601 (https://phabricator.wikimedia.org/T131957) [23:44:14] (03PS1) 10Aaron Schulz: Fix "shard" filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330602 [23:44:53] (03PS2) 10Aaron Schulz: Fix "shard" logging processor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330602 [23:47:00] AaronSchulz: beta updates are broken due to that [23:47:01] I'd rather us do an outright revert than try to monkey patch it forward [23:47:05] (03CR) 10Catrope: [C: 031] Disable NewUserMessage gomwiki to prevent corruptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330601 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [23:47:13] yes, please just revert [23:47:24] (because right now git & deploy master are out of sync) [23:47:31] Then start a fresh patch, test, and roll out [23:49:31] (03PS1) 10Aaron Schulz: Revert "Include DB shard as a logstash column" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330606 [23:49:44] (03CR) 10Aaron Schulz: [C: 032] "Cleaning branch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330606 (owner: 10Aaron Schulz) [23:50:27] (03PS1) 10Andrew Bogott: Nova: Add identity_uri config setting [puppet] - 10https://gerrit.wikimedia.org/r/330607 (https://phabricator.wikimedia.org/T150776) [23:50:54] (03Merged) 10jenkins-bot: Revert "Include DB shard as a logstash column" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330606 (owner: 10Aaron Schulz) [23:53:12] i depooled a bunch of systmes, went into meeting, pager storm, heh [23:53:24] then saw it was eqiad and i was in codfw systems heh [23:53:30] (03PS1) 10Filippo Giunchedi: Allocate instances for restbase-dev1* [dns] - 10https://gerrit.wikimedia.org/r/330609 (https://phabricator.wikimedia.org/T153880) [23:54:17] gerrit is down? [23:54:28] ostriches: ^ [23:54:34] Yeah, for me too. Pingable but not ssh or web, just in time for my SWAT. ^ robh [23:55:12] !log krypton - chown Debian-exim:Debian-exim /var/spool/exim4/scan/ to fix Icinga-reported DISK issue - wrong permissions - see puppet/modules/exim4/manifests/init.pp line 57 ff "catch-22 with Puppet vs. package" [23:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:53] (03CR) 10jenkins-bot: Revert "Include DB shard as a logstash column" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330606 (owner: 10Aaron Schulz) [23:55:56] looks like gerrit is back just now [23:56:01] gerrit is back [23:56:04] one of the slowdowns [23:56:16] All I had to do was thinking about it, and it comes back! [23:56:22] That neural circuit is working wonders [23:56:26] see T148478 [23:56:26] T148478: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478 [23:58:14] I can do SWAT. It might just be me. [23:59:26] (03CR) 10Volans: [V: 032 C: 032] "Dummy keys for keyholder" [labs/private] - 10https://gerrit.wikimedia.org/r/330600 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)