[00:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T0000). Please do the needful. [00:00:32] 06Operations, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#2711489 (10Pchelolo) [00:00:48] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#2711495 (10Pchelolo) [00:01:01] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 2 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2711496 (10AndyRussG) @K4-713 Thanks! I'm pretty sure it's somehow related to `MessageCache` or `Revision... [00:04:05] ^ yannf’s error, mentioned above… the image is 183 megapixels, ‘well’ above $wgMaxImageArea [00:05:28] 06Operations, 10EventBus, 10Graphite, 06Services (watching): eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524#2711510 (10Pchelolo) [00:43:47] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review, 06Services (watching): rename cassandra cluster - https://phabricator.wikimedia.org/T112257#2711594 (10GWicke) [01:17:14] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:40:04] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:51:06] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:00:44] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:08:04] (03PS7) 10Ori.livneh: Module and role for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 (https://phabricator.wikimedia.org/T116102) [02:10:55] (03CR) 10Ori.livneh: [C: 032] Module and role for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 (https://phabricator.wikimedia.org/T116102) (owner: 10Ori.livneh) [02:16:33] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:25:33] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 08m 24s) [02:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:54] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:51:50] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2711739 (10BBlack) If we want to go down that kind of road, it would probably be better efficiency-wise to have varnish set simpler request-side head... [02:54:49] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 12m 47s) [02:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 13 03:01:31 UTC 2016 (duration 6m 42s) [03:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:41:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [04:40:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:50:01] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 2 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2711798 (10AndyRussG) Summary of what seem to be the most important data points: - The times I've trigger... [05:00:35] 06Operations, 06Services, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2564937 (10Gilles) Has Electron gone through security review at least? [05:32:48] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Matanya) at $day_job we have a mailbox that recives maint-announce emails and that goes to our ticketing system. You could do the same by changing the current maint-announce email to task@phabricator.wikimedi... [05:38:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2711818 (10Gilles) [05:38:50] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2711814 (10Gilles) 05Open>03Resolved Only happened once on a 450MB TIFF since we increased the limit. I'm going to go ahead and assume that my theory is correct and that it... [05:40:25] 06Operations, 10MediaWiki-Maintenance-scripts, 06Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#2711820 (10Gilles) Where does the code for that live? [05:41:38] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2711822 (10Gilles) "monitoring and alarming" is still unchecked on this task. Is that true? Is there still something to do there? I thought we had paging ala... [05:45:08] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2711825 (10Gilles) [05:45:11] 06Operations, 06Performance-Team, 10Thumbor: Gifsicle engine: AttributeError: 'Engine' object has no attribute 'exif' - https://phabricator.wikimedia.org/T145504#2711824 (10Gilles) [05:45:48] 06Operations, 06Performance-Team, 10Thumbor: Gifsicle engine: AttributeError: 'Engine' object has no attribute 'exif' - https://phabricator.wikimedia.org/T145504#2632355 (10Gilles) Orphaning this task, since it's an upstream low priority bugfix, not a dependency for production deployment. [05:57:17] 06Operations, 06Services, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2711882 (10GWicke) Some leads on firejail support: - https://firejail.wordpress.com/documentation-2/x11-guide/ - https://www.xpra.org/trac/wiki... [06:02:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315625 (https://phabricator.wikimedia.org/T147305) [06:13:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:14:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [06:14:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BR [06:17:40] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2711901 (10Matanya) In addition, one can set a new mail address, (e.g. maint-announce@phabricator.wikimedia.org.) and with herald rules automatically add project maint-announce and mark private, give this address to vendor... [06:19:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:22:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:26:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:27:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [06:31:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:33:37] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tree],Package[ack-grep] [06:35:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:37:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:44:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:52:43] !log installing ghostscript security updates [06:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:40] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:01:58] (03PS6) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [07:10:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:14:02] !log Dropping hitcounter, _counter memory tables in S5 (dewiki, wikidatawiki) - T132837 [07:14:03] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [07:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:56] !log Dropping hitcounter, _counter memory tables in S6 (frwiki jawiki ruwiki) on db1050 (master) - T132837 [07:25:57] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [07:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:26:54] (03CR) 10Mobrovac: [C: 04-1] Conftool: Create script that checks the state after (de)pooling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [07:30:18] self -1 mobrovac ? :P [07:30:38] haha [07:31:02] elukey: feeling s&m today :D [07:32:10] https://en.wikipedia.org/wiki/De_gustibus_non_est_disputandum [07:32:26] yuvipanda! [07:33:31] elukey: lol [07:33:46] haven't heard that one in a while [07:34:20] ahahahha [07:52:47] (03PS7) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [07:53:23] <_joe_> mobrovac: please look at this version :P [07:53:35] <_joe_> mobrovac: I already used it to restart hhvm on a server [07:53:39] (03CR) 10jenkins-bot: [V: 04-1] Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [07:53:59] <_joe_> rubocop I hate you [07:54:29] <_joe_> modules/conftool/files/pooler-loop.rb:105:6: C: Prefer $CHILD_STATUS from the stdlib 'English' module over $? [07:54:51] <_joe_> mobrovac: you can fix those if you want to, I won't do it for now, I have other fishes to fry :P [07:58:22] (03PS1) 10Giuseppe Lavagetto: role::etcd::common: fix call to backup class to consider cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/315636 [07:58:30] <_joe_> elukey: ^^ [07:59:57] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd::common: fix call to backup class to consider cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/315636 (owner: 10Giuseppe Lavagetto) [08:14:50] thanks _joe_! [08:15:04] (03CR) 10Jcrespo: [C: 04-2] mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [08:21:01] (03PS1) 10Ema: dstat_varnishstat: define 'counters' as class variable [puppet] - 10https://gerrit.wikimedia.org/r/315643 [08:22:22] (03CR) 10Ema: [C: 032] dstat_varnishstat: define 'counters' as class variable [puppet] - 10https://gerrit.wikimedia.org/r/315643 (owner: 10Ema) [08:27:22] (03CR) 10Hashar: "Looks like puppet needs a "force" parameter to delete directory even if empty :( I have manually rmdir /var/lib/jenkins/init.groovy.d on" [puppet] - 10https://gerrit.wikimedia.org/r/315563 (owner: 10Hashar) [08:28:18] (03CR) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [08:31:54] !log updating app server canaries to new hhvm package [08:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:06] !log restarting aqs on aqs1004 to pick up the new nodejs package [08:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:15] (03PS2) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) [08:36:19] (03PS3) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) [08:37:48] (03CR) 10MarcoAurelio: "> Product sign-off, FWIW." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [08:38:41] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.26 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315507 (owner: 10Gilles) [08:52:39] (03PS3) 10Alex Monk: Follow-up Ifa2cc187: Add ShortUrl support on wikimedia.org docroot sites [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) [08:53:04] (03PS2) 10Filippo Giunchedi: Use wikimedia GIF engine for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/315508 (owner: 10Gilles) [08:54:54] (03CR) 10Filippo Giunchedi: [C: 032] Use wikimedia GIF engine for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/315508 (owner: 10Gilles) [08:59:11] (03PS1) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) [09:00:20] (03CR) 10jenkins-bot: [V: 04-1] Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [09:02:58] (03PS2) 10Marostegui: db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315625 (https://phabricator.wikimedia.org/T147305) [09:04:35] (03PS2) 10Muehlenhoff: Configure wasat for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315470 [09:05:39] (03CR) 10Muehlenhoff: [C: 032] Configure wasat for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315470 (owner: 10Muehlenhoff) [09:06:55] (03PS2) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) [09:08:28] (03PS2) 10Giuseppe Lavagetto: lvm: add module from puppetlabs. [puppet] - 10https://gerrit.wikimedia.org/r/315293 [09:09:39] !log reimaging wasat to jessie [09:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:21] (03CR) 10Hashar: "stdout/stderr is captured by systemd and ends up in its journal as well as syslog in daemon.log. Might want to route the log to a specifi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [09:14:37] (03PS1) 10Filippo Giunchedi: rsyslog: fully port receiver to jessie [puppet] - 10https://gerrit.wikimedia.org/r/315649 [09:15:23] !log Stopping MySQL in db2057.codfw.wmnet to use it to clone another server [09:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:50] (03CR) 10Hashar: [C: 031] Gerrit: puppetize log4j.properties [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [09:38:10] <_joe_> uhm bad netsplit going on I'd say [09:41:38] Looks like mira.codfw.wmnet isn't working for me as a deployment server, it gets stuck when syncing-masters :| [09:41:59] https://phabricator.wikimedia.org/P4208 [09:42:43] Oh, it is actually doing stuff, but super slow [09:45:07] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Depool db1068 for an ALTER table - T147305 (duration: 04m 58s) [09:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:21] wasat.codfw.wmnet host failed to sync, which looks down [09:46:40] marostegui: it's currently reimaged to jessie, see SAL [09:47:00] Oh yes, sorry about that. Thanks! [09:47:18] <_joe_> marostegui: uhm super slow? I hope we didn't hit again some sync bug [09:49:13] !log reimaging mw1165 to Debian Jessie (MW Jobrunner) [09:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:36] !log Deploying schema change on commonswiki.revision - db1068 - T147305 [09:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:47] (03PS1) 10MarcoAurelio: Create a 'templateeditor' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) [10:15:18] (03CR) 10MarcoAurelio: Create a 'templateeditor' user group at en.wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) (owner: 10MarcoAurelio) [10:17:15] (03PS2) 10MarcoAurelio: Create a 'templateeditor' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) [10:23:01] (03PS3) 10MarcoAurelio: Create a 'templateeditor' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) [10:31:25] !log Ran (updated) T132839-Workarounds.sh from my home in terbium [10:31:26] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:43] (03PS1) 10Marostegui: Remove db1019 [software] - 10https://gerrit.wikimedia.org/r/315656 (https://phabricator.wikimedia.org/T146265) [10:38:06] (03CR) 10Gehel: [C: 032] wdqs - move monitoring of response time to service, not individual hosts [puppet] - 10https://gerrit.wikimedia.org/r/315651 (https://phabricator.wikimedia.org/T148015) (owner: 10Gehel) [10:44:03] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:47:11] RECOVERY - Disk space on scb1002 is OK: DISK OK [10:49:44] (03CR) 10Marostegui: [C: 032] Remove db1019 [software] - 10https://gerrit.wikimedia.org/r/315656 (https://phabricator.wikimedia.org/T146265) (owner: 10Marostegui) [10:50:12] (03PS2) 10Alexandros Kosiaris: conftool: Remove apertium from sca machines [puppet] - 10https://gerrit.wikimedia.org/r/313968 (https://phabricator.wikimedia.org/T147288) [10:50:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] conftool: Remove apertium from sca machines [puppet] - 10https://gerrit.wikimedia.org/r/313968 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [10:51:15] (03Merged) 10jenkins-bot: Remove db1019 [software] - 10https://gerrit.wikimedia.org/r/315656 (https://phabricator.wikimedia.org/T146265) (owner: 10Marostegui) [10:51:23] o/ [10:53:30] (03PS1) 10Gehel: wdqs - activate monitoring of LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315658 [10:56:45] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Remove the apertium service from sca [puppet] - 10https://gerrit.wikimedia.org/r/313969 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [10:56:49] (03PS2) 10Alexandros Kosiaris: conftool: Remove the apertium service from sca [puppet] - 10https://gerrit.wikimedia.org/r/313969 (https://phabricator.wikimedia.org/T147288) [10:56:51] (03CR) 10Alexandros Kosiaris: [V: 032] conftool: Remove the apertium service from sca [puppet] - 10https://gerrit.wikimedia.org/r/313969 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [10:57:44] (03CR) 10Alexandros Kosiaris: [C: 032] Remove apertium from sca [puppet] - 10https://gerrit.wikimedia.org/r/313970 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [10:57:48] (03PS2) 10Alexandros Kosiaris: Remove apertium from sca [puppet] - 10https://gerrit.wikimedia.org/r/313970 (https://phabricator.wikimedia.org/T147288) [10:57:50] (03CR) 10Alexandros Kosiaris: [V: 032] Remove apertium from sca [puppet] - 10https://gerrit.wikimedia.org/r/313970 (https://phabricator.wikimedia.org/T147288) (owner: 10Alexandros Kosiaris) [11:01:04] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:09:36] (03CR) 10MarcoAurelio: "> Wow, I wanted to propose this in the wiki due to a different issue," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [11:11:49] RECOVERY - mediawiki-installation DSH group on mw1164 is OK: OK [11:13:18] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2712450 (10ema) p:05Triage>03Normal [11:14:27] !log mw1165 (MW Jobrunner) back in service after reimage [11:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2712454 (10Gilles) Still broken, because of something quite tricky. USE_GIFSICLE_ENGINE has to be off to let the wikimedia plugins run instead of thumbor giving GIF handling... [11:20:26] 06Operations, 10Monitoring: Extract metrics from logs - https://phabricator.wikimedia.org/T147923#2712461 (10fgiunchedi) I'll be following up with puppetization for mtail, though this is what I was able to get from wezen syslog central server with mtail as a test: (shown in prometheus format, though the same m... [11:26:46] (03PS1) 10Ema: Upgrade cp1008 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/315662 (https://phabricator.wikimedia.org/T131503) [11:28:29] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2712468 (10Gilles) [11:29:54] (03PS1) 10Gilles: Upgrade to 0.1.27 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315663 (https://phabricator.wikimedia.org/T147919) [11:30:51] bblack: FYI https://www.globalsign.com/en/status/ [11:32:21] (03CR) 10Ema: [C: 032] Upgrade cp1008 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/315662 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [11:33:07] ema nearly done with the goal ;p [11:33:18] almost there! [11:49:33] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:45] wow [11:59:49] hoo: ? it's not even loading for me [12:00:24] bblack: Not surprising [12:00:55] Their OSCP is going crazy https://twitter.com/globalsign/status/786505261842247680 [12:10:51] we use OCSP stapling [12:11:06] so it shouldn't affect us unless it remains down for a while [12:12:48] (03PS1) 10Alexandros Kosiaris: Set trusty as the installer for the sca VMs [puppet] - 10https://gerrit.wikimedia.org/r/315667 [12:12:49] "OCSP stapling" means that our servers poll GlobalSign's OCSP servers, get and cache the response once, then transmit it to UAs alongside the certificate [12:12:59] Only saw one user report, so probably fine [12:13:23] link? [12:13:32] In #wikimedia-tech [12:14:45] hrm [12:14:46] (03CR) 10Alexandros Kosiaris: [C: 032] Set trusty as the installer for the sca VMs [puppet] - 10https://gerrit.wikimedia.org/r/315667 (owner: 10Alexandros Kosiaris) [12:14:47] weird [12:14:51] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:14:59] I wonder why Chrome would poll the OCSP server when a stapled response is present [12:19:50] is chrome failing on us? [12:20:31] in any case, we cache 12h OCSP responses from GS, and we refresh them once an hour when everything's working [12:20:44] so we have ~11h for them to fix any issues with their OCSP responder before it starts become a direct issue for us [12:20:46] there is a report from a user using Chrome getting an error [12:20:54] but it wfm [12:20:58] so who knows [12:21:09] did they say what site? we only staple cache-terminated certs [12:21:10] could be a stapling-striping corporate for all we know [12:21:19] not all the one-off direct public services [12:21:21] 14:24 < pietrodn> I get this error on all Wikimedia domains. GlobalSign certificate is marked as expired on Chrome, Safari on macOS Sierra. Works with Firefox though. [12:23:41] looking at one of our text caches to check, the cached ocsp staple data we have there was last successfully fetched at 11:23 (about an hour ago), and they're actually giving us a 4-day validity on the response: [12:23:45] This Update: Oct 13 11:23:01 2016 GMT [12:23:48] Next Update: Oct 17 11:23:01 2016 GMT [12:24:03] (that's a change from last I looked, they used to only give us 12 hours in each response) [12:24:40] we haven't logged any failures so far, so this outage doesn't seem to be affecting our OCSP responses at all, even to our caches [12:24:53] (or I'm misunderstanding the nature of the problem) [12:25:01] (03CR) 10Gehel: [C: 032] wdqs - activate monitoring of LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315658 (owner: 10Gehel) [12:25:09] (03PS2) 10Gehel: wdqs - activate monitoring of LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315658 [12:25:17] that same server just did its next hourly update, also successful just now, new window: [12:25:20] This Update: Oct 13 12:23:02 2016 GMT [12:25:23] Next Update: Oct 17 12:23:02 2016 GMT [12:25:49] maybe this is only for certain globalsign intermediates, and thus maybe e.g. only affects EV-cert customers? [12:26:26] could be [12:26:47] although the user report sounds a little too correlated to be a coincidence [12:26:54] they used to use cloudflare for their OCSP [12:27:01] so maybe it's some regional issue [12:27:39] true [12:27:40] ocsp2.globalsign.com is an alias for cdn.globalsigncdn.com. [12:27:51] still cloudflare [12:27:55] we're pulling it directly in esams too and no issue there [12:28:19] I wonder why the reports talk so much about macs specifically and CRLs [12:28:23] does it pull it via squid? [12:28:38] it sounds like more than just an OCSP responder issue [12:28:42] not all UAs check OCSP by default [12:28:50] firefox didn't last I check [12:28:56] unless you enabled an option in about:config [12:29:01] FF is the one that used to always check OCSP [12:29:21] (as in, when we have in the past screwed up our stapled responses, FF users were the only ones reporting the failure) [12:30:14] IIRC firefox does OCSP by default, but if the request fails it continues and doesn't error out [12:30:27] it only errors out if it *successfully* gets a revoked OCSP response [12:30:37] which sounds like what's happening here, somehow [12:30:57] maybe I should double-check we don't have any "successful" cached ocsp data which actually says "revoked" :) [12:31:03] there was a about:config option to make it require a valid OCSP response before proceeding [12:31:06] I used to have that enabled [12:31:17] but it failed on every captive portal hotspot out there, essentially [12:31:45] so I turned that off again [12:32:22] none of our current staples for the unified cert show revoked [12:32:37] Shouldn't have an icinga check for that? [12:32:38] they're all: Response verify OK - OCSP Response Status: successful (0x0) [12:33:05] we do, indiretly [12:33:08] yeah, it's a good point, I don't believe we check OCSP (or CRL) in our monitoring right now [12:33:09] *indirectly [12:33:11] we do? [12:33:24] yeah we fixed that as part of a cronspam reduction I think? I'll have to look [12:33:39] but I think it ultimately relies on the cron itself failing, so it's the cronjob's internal checks [12:34:02] it looks like Net::SSLeay supports OCSP [12:34:08] so check_ssl could check for that too [12:34:22] http://search.cpan.org/~mikem/Net-SSLeay-1.78/lib/Net/SSLeay.pod#Certificate_verification_and_Online_Status_Revocation_Protocol_(OCSP) [12:34:27] it's not very straightforward [12:34:50] the icinga check looks at "Freshness of OCSP Stapling files", and warns if they're getting old on mtime [12:35:03] and the updated refuses to update the file if any of various problems appear in the data or the transfer fails, etc [12:35:10] the question is just how good the checks are in the fetcher [12:35:23] I don't know that it checks the Response Status field [12:35:55] (it checks that the response is legitimate and successful and that the time window is ok, etc... but I think it assumes a successful response wouldn't say "revoked") [12:36:16] if not re.search('^Response verify OK$', ocsp_err, re.M): [12:36:16] raise Exception("Did not find verification OK in stderr:\n%s" % [12:36:43] you wrote that :P [12:37:07] yeah but that's just verification that we got a legitimate/correct response. a legitimate verified response might still say "revoked" in its data [12:37:20] that's what probably needs adding [12:44:51] (03PS1) 10BBlack: update-ocsp: check response status [puppet] - 10https://gerrit.wikimedia.org/r/315670 (https://phabricator.wikimedia.org/T93927) [12:46:14] (03PS2) 10BBlack: update-ocsp: check response status [puppet] - 10https://gerrit.wikimedia.org/r/315670 (https://phabricator.wikimedia.org/T93927) [12:46:17] pep8 and line lengths :P [12:48:01] I guess our pep8 only runs based on filename extension [12:48:18] <_joe_> bblack: yes [12:49:00] <_joe_> bblack: and it was decided we'll limit lines to 120 characters (I asked for 160, or no limit) [12:50:18] but but, how will I cope with all the ugly wrapped lines when I'm editing our puppet repo on my DEC VT100 [12:50:22] :q [12:52:01] (03CR) 10BBlack: [C: 032] update-ocsp: check response status [puppet] - 10https://gerrit.wikimedia.org/r/315670 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [12:53:11] 06Operations, 10DBA, 10MediaWiki-General-or-Unknown, 13Patch-For-Review, 05WMF-deploy-2016-10-11_(1.28.0-wmf.22): img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2712641 (10jcrespo) As this is being worked at mediawiki level, I am going to move us into m... [12:53:52] !log Dropping hitcounter, _counter memory tables in S6 (frwiki jawiki ruwiki) - T132837 [12:53:53] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [12:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:44] jouncebot: next [12:58:44] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1300) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1300). [13:00:04] mobrovac and hashar: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:47] (03PS2) 10Hashar: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) (owner: 10Jon Harald Søby) [13:01:50] doing the wikidata change [13:01:58] (03CR) 10Hashar: [C: 032] Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) (owner: 10Jon Harald Søby) [13:02:21] hashar: who is swatting? [13:02:26] (03Merged) 10jenkins-bot: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) (owner: 10Jon Harald Søby) [13:02:27] us :] [13:02:29] haha [13:03:02] let's go? [13:03:11] yeah doing the wikidata change already [13:03:51] you skipped me! [13:03:52] :P [13:04:01] the wikidata one is straightforward [13:04:11] you can rebase / +2 your already [13:04:21] should be possible to verify the languages on test.wikidata [13:07:18] i looked on mw1099 [13:07:24] aude: I see them both in my Special:Preferences [13:07:58] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:08:16] great [13:08:22] pushing [13:09:33] mobrovac: your turn :] [13:09:49] hashar: waiting on jenkins, I C+2'ed alreadt [13:09:56] !log hashar@mira Synchronized wmf-config/InitialiseSettings.php: Adding language name configuration for Wikidata T146707 (duration: 00m 53s) [13:09:57] T146707: Add Lule Sami and Pite Sami as supported languages in Wikidata - https://phabricator.wikimedia.org/T146707 [13:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:35] mobrovac: I will have to leave just after [13:10:39] have some errands to conduct [13:10:55] kk [13:10:56] the other change is fully deployed [13:11:26] bah CI failed on your patch for some reason [13:11:30] it did not receive the event [13:11:41] ah no [13:11:43] it is running [13:11:43] sorry [13:11:54] :) [13:12:32] (03CR) 10Bartosz Dziewoński: "I've not tried what "'autoAdd' => [ 'categories' => '', ]" would do, but it could result in UW adding "[[Category:]]" to pages, so let's n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [13:13:52] go jenkins, gooooo [13:14:18] yeah the test suite is borked [13:14:19] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[gdisk] [13:14:30] I am sure there are a lot of low hanging fruits to speed it up [13:14:37] cmjohnson1: Let me know when you are around and I can switchoff db1053 [13:15:26] _joe_: if you have some spare time for the zuul hiera refactor, that would be nice ( https://gerrit.wikimedia.org/r/#/c/308778/ ) [13:15:33] marostegui: give me about a hour or so..thx [13:15:41] cmjohnson1: Sure thing - thanks! [13:15:47] _joe_: I think I addressed all your suggestion, ran it through the puppet compiler but there is still a gotcha I fail to understand [13:16:04] _joe_: no hurries, I am out in a few [13:16:04] <_joe_> hashar: not right now, but I'll look [13:16:22] _joe_: thanks ! :] [13:16:41] hashar: ok, merged, can you sync? [13:16:59] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:01] includes/api/ApiPurge.php [13:17:13] on it [13:18:25] syncing [13:19:11] !log hashar@mira Synchronized php-1.28.0-wmf.22/includes/api/ApiPurge.php: ApiPurge: Set the triggering user for the LinksUpdate T147516 T147977 (duration: 00m 52s) [13:19:13] T147977: [BUG] refresh links job failing with: Call to a member function getName() on a non-object (null) - https://phabricator.wikimedia.org/T147977 [13:19:14] mobrovac: done :] [13:19:14] T147516: Create a page-properties-change event - https://phabricator.wikimedia.org/T147516 [13:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:36] hashar: kk, checking [13:19:39] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[nodejs-legacy],Package[nodejs] [13:19:53] on the OCSP front, their status page now loads, and it says: [13:20:03] We are currently experiencing a known issue which is causing certificate revocation/error messages to be displayed within some of our certificates. We ask all customers to please follow the instructions in this support article to clear their cache. [13:20:20] "customers" being end users hitting sites with GS certs heh [13:20:33] the support link is: https://support.globalsign.com/customer/portal/articles/1353318-view-and-or-delete-crl-ocsp-cache [13:20:33] !log rolling restart of restbase in eqiad to pick up new nodejs [13:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:01] but I suspect the "some of our certificates" doesn't apply to the intermediate we use, as we're still seeing no OCSP fetch failures on our end [13:22:32] hashar: kk, all good, we're done :) [13:22:35] hashar: thnx [13:22:39] awesome! [13:22:57] !log European SWAT completed [13:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:03] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Maps - error when doing initial tiles generation: "Error: could not create converter for SQL_ASCII"" - https://phabricator.wikimedia.org/T148031#2712688 (10Gehel) [13:24:11] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.27 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315663 (https://phabricator.wikimedia.org/T147919) (owner: 10Gilles) [13:32:11] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 743 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3139256 keys - replication_delay is 743 [13:41:44] 06Operations, 10Traffic, 10netops: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712727 (10BBlack) Proposed subnet mapping: * eqiad ** high-traffic1 (lvs1001 + lvs1004) *** 208.80.154.224/28 (224-239) *** 2620:0:861:ed1a::0:0/111 (::0:0 - ::1:ffff) ** high-traffic2... [13:43:58] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:35] (03CR) 10Andrew Bogott: [C: 032] puppet: Add a message announcing deprecation to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [13:48:39] (03PS3) 10Andrew Bogott: puppet: Add a message announcing deprecation to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [13:54:15] 06Operations, 10Traffic, 10netops: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712836 (10faidon) I could find out, but since you've already done the investigation: do we need to renumber or relocate any IPs for this scheme to work? If so, which? [13:58:03] 06Operations, 10Traffic, 10netops: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712848 (10BBlack) Audit result: All datacenters already obey the mapping above, except for 3x exceptions in eqiad: * ocg.svc.eqiad.wmnet - currently in high-traffic2, should be in low-... [14:07:02] (03CR) 10Gilles: [C: 04-1] "Error: Could not start Service[cgrulesengd]: Execution of '/bin/systemctl start cgrulesengd' returned 1: Job for cgrulesengd.service faile" [puppet] - 10https://gerrit.wikimedia.org/r/315248 (owner: 10Gilles) [14:08:43] (03PS1) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) [14:08:45] (03PS1) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) [14:08:47] (03PS1) 10Gehel: kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/315677 (https://phabricator.wikimedia.org/T132458) [14:11:10] (03CR) 10Gehel: "I'm not entirely sure about the naming of the cluster / service here. I named both "kibana", but it might make sense to name "cluster=logs" [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [14:18:37] (03CR) 10Gilles: "Not sure how to go forward, it complains about /etc/cgconfig.conf being missing, but when I use the recommended way to generate it based o" [puppet] - 10https://gerrit.wikimedia.org/r/315248 (owner: 10Gilles) [14:23:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2636188 (10Gilles) a:05fgiunchedi>03Gilles [14:25:25] (03CR) 10Gilles: [C: 04-1] "Waiting for Filippo's change creating a proper puppet module for mtail" [puppet] - 10https://gerrit.wikimedia.org/r/315272 (owner: 10Gilles) [14:28:27] marostegui: I am ready whenever you are [14:28:33] (03CR) 10Gilles: [C: 04-1] "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/srv/thumbor] is already declared i" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [14:28:39] cmjohnson1: Sure, going to stop mysql [14:30:34] (03CR) 10Dzahn: "could you give a reason for the -2?" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [14:31:32] (03PS4) 10Gilles: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 [14:32:02] !log Shutting down MySQL in db1053, it is going to be moved to another rack - T147774 [14:32:03] T147774: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774 [14:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:24] cmjohnson1: The server is now off [14:36:40] (03PS5) 10Gilles: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 [14:40:21] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2712933 (10Marostegui) Server downtimed MySQL stopped Server powered off [14:41:10] marostegui: okay...moving now [14:41:20] cmjohnson1: excellent - thank you [14:45:29] !log installing nspr security updates [14:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:00] (03CR) 10jenkins-bot: [V: 04-1] Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [14:47:42] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:38] (03PS3) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [14:49:37] (03CR) 10Gilles: [C: 04-1] "The 404 log file appears, but it remains empty when I try to throw 404s at thumbor:" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [14:50:23] (03PS1) 10Marostegui: db-eqiad.php: Repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315683 (https://phabricator.wikimedia.org/T147305) [14:54:36] (03Abandoned) 10Marostegui: db-eqiad.php: Repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315683 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [14:57:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315685 [15:01:09] (03CR) 10Jcrespo: "Too large to explain. Please contact the owner before doing large refactoring. You will lose a lot of time otherwise. Pending changes will" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:01:56] (03CR) 10Gilles: "Probably actually works, it's just that beta thumbor isn't receiving beta swift traffic" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [15:05:36] can we update the topic here with https://twitter.com/globalsign/status/786505261842247680 [15:05:41] needs an op [15:07:43] greg-g, robh ^ [15:11:02] (03PS1) 10Cmjohnson: Moving db1053 to row A, updating dns entries(T147774), at same time removing dns entries for decom host db1010 (T129395) [dns] - 10https://gerrit.wikimedia.org/r/315690 [15:11:41] (03CR) 10Cmjohnson: [C: 032] Moving db1053 to row A, updating dns entries(T147774), at same time removing dns entries for decom host db1010 (T129395) [dns] - 10https://gerrit.wikimedia.org/r/315690 (owner: 10Cmjohnson) [15:11:42] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:12:49] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315685 [15:16:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315685 (owner: 10Marostegui) [15:17:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315685 (owner: 10Marostegui) [15:18:41] 06Operations, 07Puppet, 15User-Joe: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2713061 (10yuvipanda) >>! In T119042#2521941, @faidon wrote: > Why can't we just move all the roles from manifests/roles into modules/roles i... [15:18:47] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Repool db1068 after it was out for an ALTER table - T147305 (duration: 00m 58s) [15:18:49] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [15:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:32] (03CR) 10Yuvipanda: "https://phabricator.wikimedia.org/T119042#2713061 and rest of that ticket for more context." [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:19:47] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2713063 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson @Marostegui db1053 has been moved to A2 DNS updated Switch Cfg updated Racktables updated [15:19:50] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2713066 (10yuvipanda) p:05Low>03Normal [15:20:18] Thank you cmjohnson1 - you will power on the server for me? [15:20:50] yes, it's booting now [15:21:33] Awesome, I will update our db-xx files with the new IP as per you commit: 10.64.0.87 [15:23:56] (03PS1) 10Marostegui: db1053 update IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315691 (https://phabricator.wikimedia.org/T147774) [15:24:21] (03PS5) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) [15:24:35] (03CR) 10jenkins-bot: [V: 04-1] mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:26:06] !log T133395: RESTBase: Altering keyspace local_group_wikimedia_T_parsoid_html.data to enable time-window compaction [15:26:07] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:43] So none of the WM sites are loading for me/ [15:28:41] CP678|Laptop: are you using MacOS Sierra? [15:28:46] Yes [15:29:07] CP678|Laptop: and can you send a screenshot of details? whatever it shows on the error page, or more details from clicking on the red x or some details link, etc? [15:29:13] Have been using it since July [15:29:13] (03CR) 10Jcrespo: "And in 1 year I was not even once referred to that ticket. My -2 still stands." [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:29:15] CP678|Laptop: (also, you can probably workaround for now by using Firefox) [15:29:48] CP678|Laptop: we're still investigating, but there seems to be a problem created by our certificate provider, which incidentally only affects Safari+Chrome on Sierra... [15:29:55] It just says it can't establish a secure connection [15:30:00] There is no certificate error anymore, it just says that yeah ^ [15:30:04] (03CR) 10Jcrespo: "In fact, this patch goes against Faidon's suggestion, which will solve most issues." [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:30:56] My iPhone is doing that intermittently too. [15:31:05] Only for wikis though [15:31:17] !log installing libdbd-mysql-perl security updates [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:30] CP678|Laptop: the iphone is iOS10 I assume, which is roughly equivalent to Sierra [15:32:36] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2713121 (10Cmjohnson) [15:32:38] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2713119 (10Cmjohnson) 05Open>03Resolved [15:32:42] 10.1 [15:32:47] yeah [15:33:09] I don't have safari in front of me to give you specifics, but there's probably a way to show a more-detail error, which could be helpful [15:33:15] jynus: did you find what you needed for UA? [15:33:18] (03CR) 10Jcrespo: [C: 031] db1053 update IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315691 (https://phabricator.wikimedia.org/T147774) (owner: 10Marostegui) [15:33:29] nuria, no :-( [15:33:41] (03PS2) 10Marostegui: db1053 update IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315691 (https://phabricator.wikimedia.org/T147774) [15:33:52] jynus: what do you need exactly? [15:33:57] to [15:33:58] to [15:33:58] nuria: o/ :) [15:34:06] marostegui: hola amijoooo!!!! [15:34:09] to [15:34:12] WTF?? [15:34:13] nuria, there are some ios and mac users having issues [15:34:17] Failed to load resource: An SSL error has occurred and a secure connection to the server cannot be made. [15:34:19] with accessing the site [15:34:21] jynus would this https://phabricator.wikimedia.org/D413 work for fulltext [15:34:22] innodb [15:34:22] There we go. [15:34:24] ? [15:34:25] jynus: aham [15:34:29] We have it live on phab-01 [15:34:42] nuria, it would be nice to evaulate the impact of how widespread it is [15:34:48] bblack: Failed to load resource: An SSL error has occurred and a secure connection to the server cannot be made. [15:34:49] (03CR) 10Marostegui: [C: 032] db1053 update IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315691 (https://phabricator.wikimedia.org/T147774) (owner: 10Marostegui) [15:34:51] (03CR) 10Gilles: "Confirmed to work on beta:" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [15:34:51] it has been happening only in the last hours [15:34:55] jynus: as in lost requests compared to other days? [15:35:07] on the full aggregates we do not see significative changes [15:35:15] (03Merged) 10jenkins-bot: db1053 update IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315691 (https://phabricator.wikimedia.org/T147774) (owner: 10Marostegui) [15:35:19] so I was hoping to see something on those user agents [15:35:28] jynus: is this apps ? or web? [15:35:33] bblack: I don't think it can get more specific than that/ [15:35:45] At least I don't know how to extract more details [15:35:46] a drop compared to yesteday, or a drop compared to the previous hour [15:36:00] CP678|Laptop: if you click the (broken) lock icon there might be more info [15:36:16] nuria, probably all, but we have only see it on safari or chrome on osx sierra, I think [15:36:27] What broken lock? There is no broken lock [15:36:31] jynus: ok understood [15:36:49] CP678|Laptop in the address bar [15:36:49] CP678|Laptop: in your address bar? there's probably /something/ left of the web addres [15:36:52] basically, if you can take some time to help us here, it is a really nasty problem for users [15:36:52] valhallasw`cloud: Safari doesn't have broken lock like Chrome [15:36:55] cmjohnson1: I am still unable to ping db1053 with its new IP [15:36:57] <_joe_> no [15:37:00] <_joe_> safari has nothing [15:37:01] valhallasw`cloud: nope [15:37:08] <_joe_> I just tried to get any info [15:37:09] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: wmf-config/db-codfw.php db1053 got moved to another rack so updating its IP - T147774 (duration: 00m 50s) [15:37:10] T147774: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774 [15:37:14] <_joe_> from safari on sierra [15:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:16] <_joe_> no luck [15:37:30] jynus: is there a ticket about this? [15:37:50] safari looks like this: https://www.dropbox.com/s/ykxpm7pp3ek1hy5/Screenshot%202016-10-13%2008.36.12.png?dl=0 [15:37:51] marostegui: did you reinstall? [15:37:58] cmjohnson1: No [15:38:07] nuria i can create a ticket if you need one [15:38:25] are you planning on re-installing? [15:38:34] cmjohnson1: Not at this point (certainly not today no) :) [15:38:55] Zppix: yes please [15:39:01] nuria one moment [15:39:04] Zppix: a aticket that will describe problem [15:39:14] Zppix: will help [15:40:20] marostegui if not we have to update to /etc/network/interfaces on the server...right now it still has old ip. [15:40:34] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2713144 (10Gilles) [15:40:36] cmjohnson1: Could you do that for me? [15:40:37] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2713143 (10Gilles) 05Open>03Resolved [15:40:38] Zppix: hwo urgen is this? [15:40:44] 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) [15:40:53] Zppix: can we get numbers in 2 hours? cc jynus [15:41:11] nuria which numbers? [15:41:20] Zppix: to quantify request dropp [15:41:56] nuria one moment [15:42:01] jynus: we will work in getting some numbers [15:42:09] nuria, thanks [15:42:11] bblack: https://dl.dropboxusercontent.com/u/1387996/Screen%20Shot%202016-10-13%20at%209.08.53%20PM.png and https://dl.dropboxusercontent.com/u/1387996/Screen%20Shot%202016-10-13%20at%209.09.57%20PM.png [15:42:15] Now [15:42:27] we just need requests, nothing else to evaulate how widespread [15:42:28] 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713160 (10Zppix) [15:42:29] <_joe_> kart_: sierra? [15:42:31] (one is for enwiki, guwiki) [15:42:33] _joe_: yep [15:42:37] <_joe_> kart_: if so, it's known [15:42:53] yeah, I hope these will be helpful in anyway. [15:42:54] kart_: can you click on the intermediate in that first screenshot and see what it says there? [15:43:05] the G2 cert in the middle of the chain [15:43:19] 06Operations, 06Analytics-Kanban: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713164 (10Nuria) [15:43:19] How many users are in this channel right now unable to access WMF sites due to certificate errors? [15:43:40] Zppix: we've been tracking and debugging this for a while now, just not in phab [15:43:45] 06Operations, 06Analytics-Kanban: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10jcrespo) The workaround for now is to use Firefox, as it has its own TLS stack different from the OS one. [15:43:52] Zppix is this wikipedia? [15:43:54] so far as we know, it's only affecting Safari+Chrome on MacOS Sierra [15:43:54] It works for me [15:43:55] 06Operations, 06Analytics-Kanban: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713167 (10jcrespo) p:05Triage>03Unbreak! [15:44:08] and the only known workaround is: use Firefox [15:44:11] paladox only macOS sierra users on chrome+safari [15:44:27] 06Operations, 06Analytics-Kanban, 10Traffic: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713169 (10jcrespo) [15:44:27] Oh [15:44:33] Works on my iphone [15:44:49] I would have thought safari is a shared core with ios [15:44:53] some iOS10 may be intermittently affected as well, we've seen one report here [15:44:59] Oh [15:45:02] IOS 10 works for me [15:45:03] wouldn't expect iOS9 [15:45:10] But im running iOS 10.1 [15:45:44] I get the little lock icon on the web url bit, no red colors just normal telling you, you have https. [15:46:13] bblack: on Safari if you got the popup you can say continue [15:46:14] let me try my ios 8.1 phone and see [15:46:18] 06Operations, 06Performance-Team, 10Thumbor: Extracted ICC profile don't get cleaned up - https://phabricator.wikimedia.org/T147921#2713172 (10Gilles) 05Open>03Resolved The temp folders are now squeaky clean, only containing files being currently processed: ``` gilles@thumbor1001:/srv/thumbor/tmp$ sudo... [15:46:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2713174 (10Gilles) [15:47:26] 06Operations, 06Analytics-Kanban, 10Traffic: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713178 (10Zppix) Doesn't appear to affect the iOS 8.1 app for Wikipedia. [15:47:51] (03CR) 10Luke081515: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) (owner: 10MarcoAurelio) [15:47:52] bblack: sorry late. G2 shows this: https://dl.dropboxusercontent.com/u/1387996/Screen%20Shot%202016-10-13%20at%209.15.35%20PM.png [15:47:58] bblack it affects Microsoft Edge [15:47:59] Too [15:48:08] But i carn't reproduce it in IE, or chrome [15:48:13] Which is strange [15:48:23] iOS 8.1 safari is fine for me [15:48:44] paladox OS version please? [15:48:53] (03CR) 10Luke081515: [C: 031] "ok per Task discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [15:48:57] Zppix windows 10 [15:48:58] fyi, i'm running macOS sierra and can repro the chrome & safari issues (and also can confirm that firefox works fine). let me know if you need me to test or repro anything [15:49:05] Well windows 10 build 14342 [15:49:08] an insider build [15:49:14] 06Operations, 06Analytics-Kanban, 10Traffic: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713181 (10jcrespo) These has been some of the updates we had recently: > We don't yet understand the full scope or specifics of either the > underlying issue GlobalSign is having, or any impa... [15:49:19] 06Operations, 06Analytics-Kanban, 10Traffic: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713182 (10Zppix) [15:50:11] mdholloway can you get any details in console or anything we need any details we can get our hands on [15:50:16] (03CR) 10Luke081515: [C: 031] Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [15:50:22] paladox: for Win10, can you try clearing the CRL/OCSP caches? (it hasn't worked for mac users that we've seen) [15:50:33] https://support.globalsign.com/customer/portal/articles/1353318-view-and-or-delete-crl-ocsp-cache [15:50:34] Ok i will try that [15:50:55] Zppix: sure, i'll start digging [15:51:27] bblack should i click the clear cache button in edge [15:51:36] paladox: it's not that cache [15:51:47] Oh [15:51:49] paladox: there's a link above from globalsign about clearing the CRL/OCSP cache on the commandline [15:51:55] Oh [15:52:07] certutil -urlcache * delete [15:52:08] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713203 (10Zppix) [15:52:12] Thanks [15:52:49] Oh i have visted alot of sites LOL [15:53:30] bblack CertUtil: -URLCache command FAILED: 0x80070103 (WIN32/HTTP: 259 ERROR_NO_MORE_ITEMS) [15:53:31] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2713206 (10jcrespo) Sorry, I think you do the vlan yourself, I copied from a codfw ticket. Next tickets will strictly follow the documented procedure. [15:53:40] WinHttp Cache entries deleted: 1047 [15:54:50] does the site load in Edge now? [15:54:54] Nope [15:55:00] It shows the certificate error [15:55:04] bblack ^^ [15:55:27] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713217 (10Zppix) Clearing CertUlti on edge doesn't fix the issue [15:55:41] paladox: Edge on Win10, but not IE, is affected for you, right? [15:55:46] Yep [15:55:49] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2713220 (10Marostegui) /etc/network/interfaces changed to reflect the new IP. All good - thanks Chris. [15:56:02] bblack edge is totally blocking wikimedia sites [15:56:08] ok [15:56:10] thanks! [15:56:21] We recommend that you close this web page and do not continue to this website. [15:56:30] Go to my homepage instead [15:56:42] can you provide screenshots by chance paladox? [15:56:47] Yep [15:56:49] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713223 (10ema) [15:57:43] Zppix bblack https://phabricator.wikimedia.org/F4600331 [15:57:47] (03CR) 10Luke081515: [C: 031] "Looks ok for me, but someone other may take a look." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) (owner: 10MarcoAurelio) [15:57:52] chrome reports NET::ERR_CERT_REVOKED. likewise, safari is reporting that the GS intermediate cert is revoked. [15:58:19] mdholloway: on MacOS Sierra, right? [15:58:21] i can't view the cert in chrome for some reason. [15:58:22] yep [15:58:27] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713243 (10Zppix) [15:59:22] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) Edge is completely blocking access to WMF sites as shown in screenshot number 2 in the task description [15:59:47] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713248 (10BBlack) [15:59:54] (03PS1) 10Joal: Add dataCubes description to pivot config template [puppet] - 10https://gerrit.wikimedia.org/r/315695 [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1600). Please do the needful. [16:00:16] paladox: does Chrome work on your Win10? I'm assuming FF does, but wouldn't hurt to confirm [16:00:21] bblack yep [16:00:35] I havent tryed ff yet since i doint have that installed [16:00:37] but chrome works [16:00:53] firefox works on win 7 bblack so im assuming it would work on win 10 [16:00:58] (03CR) 10Luke081515: [C: 031] Rename 'technican' and 'technician' to 'interface-editor' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [16:01:30] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:01:38] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [16:02:11] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713281 (10Zppix) [16:02:23] 06Operations, 06Analytics-Kanban, 10Traffic: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) [16:03:39] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:04:14] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2713300 (10BBlack) 05Resolved>03Open Down again! Assuming for the moment it's ethernet again... [16:05:57] (03CR) 10Luke081515: [C: 04-1] Rename 'autopatrol' to 'autopatrolled' on fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [16:06:03] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713305 (10Zppix) [16:06:05] Ugh what to do :( [16:07:09] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms [16:07:25] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [16:07:46] sjoerddebruin try to figure out what we can do about t148045 [16:08:13] Not sure what can help, and I hate using Firefox for it's font rendering. :) [16:08:43] basically any additonal info that can be provided [16:08:46] will be nice [16:09:44] bblack Zppix i now get the error on IE [16:10:16] Great... ok [16:10:25] It seems to be intermitten with IE [16:10:30] Since it works again [16:10:34] hmmm ok [16:10:48] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713344 (10Zppix) [16:10:51] Zppix bblack https://phabricator.wikimedia.org/F4600369 [16:11:12] (03CR) 10Chad: "This is exactly what we do now, as I mentioned in the commit summary...just trying to get it puppetized so we can make further adjustments" [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [16:12:25] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) [16:12:40] ok thanks paladox [16:12:55] Your welcome [16:13:11] possibly of interest: earlier today (9-10 a.m. US eastern time) for me only *.wikimedia.org and wikimediafoundation.org sites were affected by the cert problem, but *.wikipedia.org weren't. now the wikipedias are also affected. (this is with chrome/safari on sierra) [16:13:22] 06Operations, 06Labs, 10Striker, 07LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#2713357 (10MoritzMuehlenhoff) [16:14:01] (03CR) 10Elukey: "Added some comments, let me know if you like them or not!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315695 (owner: 10Joal) [16:14:38] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:18:56] ok mdholloway [16:20:13] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713388 (10Zppix) An user reports on IRC: earlier today (9-10 a.m. US eastern time) only *.wikimedia.org and wikimediafoundation.org sites were affected by the cert pro... [16:22:02] (03CR) 10Gilles: "The change applies fine on Beta, proxy-server keeps working as expected." [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [16:26:08] <_joe_> yuvipanda: what role class do you use on k8s workers? [16:26:21] https://bugs.chromium.org/p/chromium/issues/detail?id=645629 [16:27:21] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [16:27:29] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:55] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3129725 keys - replication_delay is 0 [16:27:57] <_joe_> mdholloway: seems completely unrelated [16:28:08] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 84.57 ms [16:28:17] (03PS2) 10MarcoAurelio: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) [16:28:21] [11:27] I have windows 10 I tried Latest Opera, Chrome IE 11 and MS Edge [16:28:21] [11:27] I tried to access - https://en.wikipedia.org but failed [16:28:21] joe: role::toollabs::k8s::worker [16:28:31] @ bblack [16:29:03] (03CR) 10MarcoAurelio: Rename 'autopatrol' to 'autopatrolled' on fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [16:29:12] _joe_: there's definitely stuff that should be in role::toollabs::k8s::worker that should be in k8s::worker and vice versa [16:29:33] <_joe_> yuvipanda: I'm mostly re-doing a lot of things from scratch [16:29:42] <_joe_> but I wanted to look at what you built :) [16:29:43] * yuvipanda nods [16:29:59] bblack Zppix it's affecting wmflabs now [16:30:26] joe: that seems like the appropriate thing to do [16:30:48] paladox: it potentially affects all globalsign customer certs, and they're a provider for several of ours [16:30:55] paladox ack [16:30:58] right now we're concentrating on the primary unified cert for the production domains [16:31:24] bblack do you want to make note of the WMFlabs error ? [16:32:44] I don't know that most users are going to understand which certs affect which things. it's fair to say it's possible for it to affect most things of ours [16:33:19] bblack I've gotten reports in [16:33:27] #wikipedia-en-help as well as noted above in irc [16:34:08] yeah I'm just saying, it's more than just labs [16:34:16] it's going to be most things, except a few technical one-off sites [16:34:19] bblack I know that :) [16:34:54] it seems to be staying with the trend of only affecting win 10 and sierra [16:35:44] (03PS12) 10Rush: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [16:36:42] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713447 (10Zppix) A user in ENWIKI's help irc channel reports the error on Windows 10 Professional latest version, on Chrome - Version 54.0.2840.59 beta-m [16:36:50] (03CR) 10Rush: [C: 031] "This is pretty great thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [16:37:59] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713448 (10Zppix) [16:38:08] bblack it appears opera 12 also works to access WMF sites (according to the user in -help) [16:39:54] I've just booted a windows 10 machines for the first time in weeks and the error appeared- it is not just a mere caching issue; there is caching at server side, too [16:40:24] <_joe_> jynus: there clearly is [16:40:27] jynus It's intermitten for me [16:40:27] thats great, [16:40:29] On ie [16:40:43] On edge fully on ie intermitten [16:59:20] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 85.49 ms [16:59:41] All these messages [16:59:47] What do they mean Mason? [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1700). Please do the needful. [17:00:20] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.89 ms [17:00:22] DatGuy nothing to worry about [17:00:36] deploying maps [17:00:39] bblack: hello! i don't think it's down for everyone but i think the wikipedia sites might be down for my area again? (sorry, i think you helped me out the last time) [17:00:42] DatGuy: if it starts with RECOVERY it usually is a good thing :) [17:00:53] niedzielski we're aware see T148045 [17:00:54] T148045: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045 [17:01:33] mutante what if it starts with "Total failure, main page deleted, User:Jimbo Wales compromised"? [17:01:40] Zppix: unfortunately i can't access *.wikimedia.org [17:01:52] niedzielski what browser and operating system [17:01:57] (03CR) 10Elukey: [C: 032] Add dataCubes description to pivot config template [puppet] - 10https://gerrit.wikimedia.org/r/315695 (owner: 10Joal) [17:02:02] DatGuy then you run and never stop [17:02:04] Zppix: debian linux, chromium and firefox [17:02:20] niedzielski ah, fun... firefox was our workaround [17:02:28] Zppix: android [17:02:33] Zppix ie has now gone down [17:02:40] No more intermitten [17:02:47] paladox ack [17:03:06] bblack im getting two different certificates accross different browsers [17:03:06] Does this deserve the burininate token? [17:03:21] In firefox im getting [17:03:21] GlobalSign nv-sa [17:03:25] DatGuy Please try to keep social chatter to a min atm :) [17:03:29] In chrome im getting [17:03:30] GlobalSign Organization Validation CA - SHA256 - G2 [17:03:40] paladox: it's a known deep technical consequence of the underlying problem at GlobalSign [17:03:46] paladox ^ [17:03:46] Oh [17:03:55] I guess thats why firefox continues to work [17:04:09] niedzielski is it possible to try to use Opera [17:05:43] Zppix: sure i can download that. but this is happening across six android devices as well [17:06:51] niedzielski, ack bblack we may need more workarounds [17:06:52] (03PS1) 10Gehel: maps - make sure tilerator notification does not run concurrently [puppet] - 10https://gerrit.wikimedia.org/r/315703 [17:06:53] silly question due to our deployment host change (my bash history is gone) - what is the proper scap3 command? I remember it being -v something, and I cannot find anything on wikitech [17:07:00] (03PS13) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [17:07:02] thcipriani, ^? [17:07:12] <_joe_> niedzielski: when you say "down" what do you mean? [17:07:22] <_joe_> you get a security warning? [17:07:25] I have to step away for a moment [17:07:28] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 35.63 ms [17:08:06] yurik: to deploy a service? scap deploy -v (for verbose output) [17:08:10] _joe_: i'm trying to do a beta release for the android app at the moment and we have a bunch of tests that use a WebView (platform builtin chromium for apps) to communicate with wikimedia domains. they're all failing with a timeout [17:08:27] thcipriani, ah, yes, thanks. We should put it on https://wikitech.wikimedia.org/wiki/Scap3 i think [17:08:32] More detailed GlobalSign explanation of the problem https://twitter.com/globalsign/status/786612660397715456 [17:08:32] <_joe_> niedzielski: where are those running from? [17:08:37] yurik: yup, doing. [17:08:49] thx :) [17:09:24] _joe_: sorry, i'm not sure if i'm missing the question but they're running on physical devices in my home in colorado springs, colorado [17:09:29] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713587 (10Paladox) More detailed GlobalSign explanation of the problem https://twitter.com/globalsign/status/786612660397715456 [17:09:54] thcipriani, could you look why kartotherian canary failed? [17:09:58] <_joe_> niedzielski: ok, I am not sure the issue is the same we're experiencing all around which is ^^ [17:10:02] 06Operations, 10Datasets-General-or-Unknown: reinstall snapshot1001.eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2713593 (10Cmjohnson) [17:10:04] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: decommission snapshot1002, 1003, 1004 - https://phabricator.wikimedia.org/T141762#2713591 (10Cmjohnson) 05Open>03Resolved [17:10:10] yurik: yup [17:10:17] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713594 (10Pietrodn) More detailed explanation of the technical problem by GlobalCert: https://downloads.globalsign.com/acton/fs/blocks/showLandingPage/a/2674/p/p-008f/t/p... [17:10:20] yurik: probably what tyler pointed out on ops list earlier this week? [17:11:04] According to them it will take 4 days for it to fix [17:11:24] bblack: do we have an non ssl ver of wmf sites to use temp im just trying to find another workaround incase firefox for some reason drops [17:11:28] _joe_ /cc bblack: right, i think this is a different issue. i think i saw it before with maybe a restbase domain? i'm sorry but i can't quite remember. it only happened to affect my area of colorado (or so they told me :] ) [17:11:30] (03PS1) 10BBlack: GlobalSign G2 intermediate, signed by R3 [puppet] - 10https://gerrit.wikimedia.org/r/315705 (https://phabricator.wikimedia.org/T148045) [17:12:18] yurik: ah, yeah, one thing: scap/scap.cfg has git_server: tin.eqiad.wmnet. You should remove that since you're deploying from mira. [17:12:30] (03CR) 10Faidon Liambotis: [C: 032] GlobalSign G2 intermediate, signed by R3 [puppet] - 10https://gerrit.wikimedia.org/r/315705 (https://phabricator.wikimedia.org/T148045) (owner: 10BBlack) [17:12:54] yurik: it failed because it can't find e7583f1069f518e4df122f18e487fb704fcec885, probably because it's looking at tin rather than mira. [17:13:05] (03CR) 10Faidon Liambotis: "@@ -2,12 +2,12 @@" [puppet] - 10https://gerrit.wikimedia.org/r/315705 (https://phabricator.wikimedia.org/T148045) (owner: 10BBlack) [17:13:31] thcipriani, should i change tin to deployment.eqiad.wmnet ? [17:13:32] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713607 (10Legoktm) [17:13:48] yurik: per _joe_ on the ops list, just update to mira.codfw.wmnet [17:14:13] the other option would be to just remove it from the ./scap/scap.cfg and it'll fall back to what is in /etc/scap.cfg [17:14:30] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947460 (10Cmjohnson) Disconnected everything, added to decom tracking sheet. [17:14:34] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:14:48] thcipriani: does /etc/scap.cfg always have the correct value? [17:14:49] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2713617 (10Cmjohnson) uploaded reports to the HP online portal [17:15:12] Zppix joe : ok, i tried opera out and see the certificate revoke error [17:15:47] Damn ok [17:16:33] mobrovac: that is the value that is set by scap::deployment_server in puppet, IIRC. It's currently set to deployment.eqiad.wmnet, which is just human convenience, so while it will work, it's not "correct" [17:16:59] uf [17:17:29] !log disabling puppet on all cache nodes [17:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:38] thcipriani: ok, then yurik should change the value to mira.codfw.wmnet [17:17:39] morebots: hrm. We could probably update that value to look at '::deployment_server' which is what is used to determine the deployment master globablly. [17:17:39] I am a logbot running on tools-exec-1219. [17:17:39] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [17:17:39] To log a message, type !log . [17:18:03] thcipriani: completion fail ^ [17:18:04] :P [17:18:28] heh, whoops [17:18:38] (03PS1) 10Yuvipanda: Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) [17:19:00] mobrovac: yurik: yeah, agree, just change to mira.codfw.wmnet [17:19:14] thcipriani, sorry, already deployed with the missing value [17:19:18] i will chaneg it for the next depl [17:20:10] !log deployed kartotherian https://gerrit.wikimedia.org/r/#/c/315701/ [17:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:15] !log powering off mw1001-1148 to be decommissioned (except mw1017 and mw1099) per T141522 [17:21:16] T141522: Physically decommission mw1001-mw1148 (except mw1017 and mw1099) - https://phabricator.wikimedia.org/T141522 [17:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:50] (03PS2) 10Yuvipanda: Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) [17:24:09] (03CR) 10jenkins-bot: [V: 04-1] Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) (owner: 10Yuvipanda) [17:26:05] Zppix joe bblack: all sites seem to be working now. thanks! [17:26:33] (03PS3) 10Yuvipanda: Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) [17:27:22] All Wikimedia sites working for me too [17:28:34] <_joe_> did you do something or they just worked? [17:29:14] Still broken here. [17:29:17] _joe_: I did this http://apple.stackexchange.com/a/257112/33925 [17:29:25] <_joe_> sjoerddebruin: try that ^^ [17:29:54] <_joe_> pietrodn: we were toyng around with the sqlite3 file, but i didn't dare touch it [17:29:54] don't forget to restart the browser after that [17:30:20] That works. [17:30:24] Please communicate this! :) [17:30:59] _joe_: well it's a cache so I think it doesn't hurt to even delete it… although messing with the keychains sounds scary as hell :) [17:31:13] pietrodn still broken on ie [17:31:14] For most sites though, still have trouble with the Dutch arbcomwiki. [17:31:24] and edge [17:31:32] (03PS4) 10Yuvipanda: Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) [17:31:32] <_joe_> pietrodn: that file is used by trustd, if you mess that up it can be /bad/ [17:31:59] bd808: ^ take a look at that image when you have the chance? particularly, C.UTF-8 locale doesn't seem to exist in Ubuntu trusty, so I've to use en_US.UTF-8 [17:32:12] so perhaps I should use en_US.UTF-8 for all containers, including jessie [17:32:12] Ill look into that paladox [17:32:16] Ok [17:32:18] thanks [17:32:27] Ty niedzielski [17:32:38] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713690 (10Pietrodn) Working workaround for Chrome and Safari on macOS Sierra: http://apple.stackexchange.com/a/257112/33925 ``` $ sqlite3 ~/Library... [17:32:48] sjoerddebruin: remind me what os and broswer [17:33:09] Sierra, Safari 10.0. [17:33:13] But like I said, only works half. [17:33:16] Zppix|mobile: macOS Sierra, Chrome+Safari [17:33:30] Most wiki's work again, but the dutch arbcomwiki and phab doesn't. [17:33:32] !log pushing new intermediate to caches - T148045 [17:33:34] T148045: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045 [17:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:53] Ok [17:34:23] sjoerddebruin: phabricator works for me [17:34:30] Weird. [17:34:36] all sites working for me (sierra/chrome) after the sqlite3 cache delete [17:34:37] including phab [17:35:21] on safari as well [17:35:38] we have a supposed universal fix rolling out now, too [17:35:44] GlobalSign site doesn't work, ironically, even after the cache delete https://downloads.globalsign.com/acton/fs/blocks/showLandingPage/a/2674/p/p-008f/t/page/fm/0 [17:35:47] yuvipanda: in meetings for a bit, but en_US.UTF-8 would work I think everywhere [17:35:48] (may already be live, on some percentage of endpoints) [17:36:02] should be live on all now [17:37:06] bblack wikipedia works for me on ie now [17:37:09] but phab dosent [17:37:12] I just ran certutil -urlcache * delete too [17:37:18] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10BBlack) We've received an updated intermediate cert from GlobalSign that's compatible with our existing end-certs and supposedly fixes the... [17:37:23] phab is cache_misc, should work [17:38:16] I can confirm several misc servers continue having the issue, wikis fixed [17:38:39] I think phab has its own separate cert [17:38:43] ? [17:38:44] yep per ^^ [17:38:56] or the user uploads do [17:39:03] not sure about the rest [17:39:17] no, phab does not have a separate cert [17:39:48] but it is not only phab, it is icinga, tendril [17:40:03] Also integration.wikimedia.org [17:40:06] still has the problem [17:40:08] icinga and tendril are not on the cache terminators and haven't been dealt with [17:40:17] ok, sorry [17:40:19] phab is on the cache terminators [17:41:00] otrs is also still inaccessible for me [17:43:10] is anyone still having a problem with https://phabricator.wikimedia.org/ on affected clients? [17:43:17] (and if so, have you tried relaunching the browser from scratch?) [17:43:36] bblack yes im still having problems, on IE [17:43:42] But en.wikipedia.org started working [17:43:57] paladox: can you close all browsers and then repro a failure for phabricator in IE? [17:44:03] Ok [17:44:09] bblack, I can confir after a browser restart [17:44:22] jynus: what browser? [17:44:26] but it is edge, who knows how intrincate it is to the os [17:44:40] Oh opening a second desktop in windows 10 [17:44:48] and opening ie and loading phab works [17:44:53] bblack jynus ^^ [17:45:00] (03PS2) 10Gehel: maps - make sure tilerator notification does not run concurrently [puppet] - 10https://gerrit.wikimedia.org/r/315703 [17:45:07] bblack: yep, on safari [17:45:47] triend on a "private session", too, just in case [17:45:48] now they work [17:46:05] ^same [17:46:10] bblack works now [17:46:13] !log forced ocsp stapling update on all caches, just in case [17:46:14] on phabricator [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:41] (03PS1) 10Yuvipanda: Switch jessie continer locale to en_US too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315708 [17:46:55] (03CR) 10Gehel: [C: 032] maps - make sure tilerator notification does not run concurrently [puppet] - 10https://gerrit.wikimedia.org/r/315703 (owner: 10Gehel) [17:47:08] bblack but wikitech is now not working [17:47:19] bd808: made another patch to mover jessie to enUS too [17:48:11] wikitech is separate [17:48:14] oh [17:48:26] we've only fixed the primary traffic cache clusters that host the big production wikis, as well as some of our tools like phabricator [17:48:45] but not all the technical one-offs like wikitech, gerrit, tendril, icinga, etc, etc [17:48:57] Oh [17:49:04] bblack gerrit should not be affected [17:49:08] it uses letsencrypt [17:49:17] It dosent use any globalsign certs [17:49:26] oh true [17:49:39] I haven't yet enumerated which do or don't use globalsign, it's a todo shortly :) [17:49:55] Oh [17:50:21] bblack do you want me to write up a list of wmf domains whom use global sign? [17:50:39] it's ok, we can figure out it definitely by running commands over our config repo, etc... [17:50:44] ok [17:50:57] just waiting for things to settle down and be sure we're comfortable with the initial fix first [17:51:22] Let's hope this fix doesnt end up b eing our next nightmare :P [17:56:14] (03PS1) 10Legoktm: contint: Install php7.0-ast for phan [puppet] - 10https://gerrit.wikimedia.org/r/315711 (https://phabricator.wikimedia.org/T132636) [17:56:57] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review, 06Services (doing): Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2713751 (10mobrovac) [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1800). [18:00:05] Pchelolo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:17] * Pchelolo is here [18:00:33] watches Pchelolo run away :P [18:00:36] any new reports of breakage on the wiki projects? or phabricator? [18:00:42] negative bblack [18:00:44] I can SWAT today [18:00:51] thank you thcipriani [18:00:57] bblack nope works for me now [18:01:10] phabricator works [18:01:28] quick question does one have to work at WMF to be apart of the operations team? or can one volunteer on the operations team? [18:02:12] There are non-WMF employees with production access, if that's what you mean. [18:02:37] Is there a volunteer with merge in puppet? I don't believe so (but I could be wrong). [18:02:47] merge rights* [18:03:01] no, i meant can one be apart of the operation team w/o needing to work at WMF [18:03:17] there are a few volunteer roots [18:03:27] it's by definition a work team :) [18:03:31] how does one volunteer for operations? [18:03:36] but most were staff [18:03:46] submit patches like any other project :) [18:03:51] (03PS1) 10Cmjohnson: Fixing typo on kubernetes1004.mgmt in wment file [dns] - 10https://gerrit.wikimedia.org/r/315713 [18:03:58] oh so anyone has access to the operation repo? [18:04:05] (03CR) 10Cmjohnson: [C: 032] Fixing typo on kubernetes1004.mgmt in wment file [dns] - 10https://gerrit.wikimedia.org/r/315713 (owner: 10Cmjohnson) [18:04:07] Zppix yes [18:04:11] operations/* [18:04:16] I wish i knew that :P [18:04:20] Zppix: https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet [18:04:21] Its in gerrit [18:04:25] Per ^^ [18:04:46] You just need to get an ops or someone with c+2 to merge [18:04:48] and review [18:05:00] your patches for operations/* [18:05:21] kk [18:05:29] Pchelolo: you change is live on mw1099 [18:05:42] *your [18:05:45] cool thcipriani, although there's no way to test it [18:06:14] Pchelolo: ah, gotcha, ok, rolling everywhere. [18:08:13] bblack i guess the fix can be rolled out to the other wikimedia sites? [18:08:19] yeah, I'm working on that now [18:09:17] Ok thanks [18:09:40] !log deployed https://phabricator.wikimedia.org/D413 on iridium and restarted apache [18:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:10] !log thcipriani@mira Synchronized php-1.28.0-wmf.22/extensions/EventBus/EventBus.hooks.php: SWAT: [[gerrit:315696|Do not set the performer property if the user is not available. (T147977)]] (duration: 01m 38s) [18:10:11] T147977: [BUG] refresh links job failing with: Call to a member function getName() on a non-object (null) - https://phabricator.wikimedia.org/T147977 [18:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:18] ^ Pchelolo live everywhere [18:10:28] thank you thcipriani :) [18:10:59] * thcipriani wonders why sync-file took so long :\ [18:12:25] What are swat deployments for? [18:14:18] Zppix: first place to look for this kind of info is wikitech: https://wikitech.wikimedia.org/wiki/SWAT_deploys [18:15:38] So phabricator search is back to normal now... FYI [18:15:45] :) [18:16:19] 06Operations, 06WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061#2713810 (10Varnent) [18:21:53] (03PS1) 10Giuseppe Lavagetto: kubernetes: introduce 1st-stage worker role [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) [18:21:59] <_joe_> twentyafterfour: meaning it won't find what I want? [18:22:00] <_joe_> :P [18:22:17] <_joe_> yuvipanda: I'd appreciate if you take a look at ^^ [18:22:24] _joe_ it should be better at finding things now [18:22:45] <_joe_> paladox: yeah I was joking about the fact I never loved the phab search in general [18:22:59] Oh [18:23:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] docker::engine: remove execs, transform to pure-puppet [puppet] - 10https://gerrit.wikimedia.org/r/315294 (owner: 10Giuseppe Lavagetto) [18:23:01] LOL [18:23:02] twentyafterfour, you deployed to production already? [18:23:12] jynus yes [18:23:18] jynus: sure enough. [18:23:22] great [18:23:23] (03Abandoned) 10Giuseppe Lavagetto: docker::engine: remove execs, transform to pure-puppet [puppet] - 10https://gerrit.wikimedia.org/r/315294 (owner: 10Giuseppe Lavagetto) [18:23:25] _joe_ we were testing with elasticsearch [18:23:27] Seems to work well [18:23:47] <_joe_> paladox: and how did that work? [18:23:56] _joe_ pretty good [18:23:59] found everything [18:24:01] I would send an email to wikitech explaining the issues [18:24:06] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 2 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2713885 (10AndyRussG) Following up on the patch ^ by @Ejegg, addressing the issue he found! Looking at th... [18:24:12] and then take next steps slowly [18:24:24] We also fixed some issues with elasticsearch 2 so should work with that [18:24:32] I would still like to do a failover of the master [18:24:42] for other reasons [18:24:49] _joe_: elasticseach works better, I think... We (#releng) discussed it today and we may bring back elasticsearch next quarter... [18:24:57] Oh :) [18:25:06] jynus: ok [18:25:15] let me prepare the patch [18:25:18] Should be really easy to turn on phab, just a config change, then an index [18:25:43] But then, there would be other problems, like where will we host elasticsearch [18:25:45] and other things [18:26:07] jynus: We can do the failover whenever you are ready, I'm here to handle phab config change [18:26:16] no config change needed [18:26:23] oh yeah you have the proxy [18:26:25] just a service restart to reload the database connections [18:26:29] yeah :-) [18:26:31] less trouble [18:26:46] even less if it didn't have persistent connections [18:27:07] it is the joy of dynamic state [18:27:12] Anyone still having Certificate problems? [18:27:43] Nope, but i am on the domains that still need the new cert. [18:28:09] (03PS4) 10Dzahn: contint: stop jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315650 (owner: 10Hashar) [18:28:11] Zppix: I was having trouble reaching phabricator until recently but it seems fine now [18:32:17] !log restbase deploy start of d510090 [18:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:33] twentyafterfour ack [18:32:48] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713909 (10BBlack) We're working through the other minor one-off cert issues now on smaller (mostly for technical folks sites), I'm breaking off a se... [18:32:52] bblack what's the verdict is SSL fixed now or are we ok? [18:33:30] (03PS1) 10Jcrespo: dbproxy: Failover phabricator dbs from db1048 to db1043 [puppet] - 10https://gerrit.wikimedia.org/r/315718 (https://phabricator.wikimedia.org/T146673) [18:34:08] (03CR) 10Dzahn: [C: 032] contint: stop jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315650 (owner: 10Hashar) [18:34:10] Zppix: I think we're fixed, pending som eupdates to minor technical sites [18:34:21] and tool labs, etc [18:34:22] ack bblack [18:34:41] (03PS2) 10Jcrespo: dbproxy: Failover phabricator dbs from db1048 to db1043 [puppet] - 10https://gerrit.wikimedia.org/r/315718 (https://phabricator.wikimedia.org/T146673) [18:34:44] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:16] bblack wikitech and integration now work for me [18:35:22] Zppix ^^ [18:35:26] twentyafterfour, I am going to deploy https://gerrit.wikimedia.org/r/315718 which doesn't do anything until a restart is sent [18:35:40] jynus: ok [18:35:49] so you want me to restart apache on iridium? [18:36:07] then I will restart-> we will go to read only -> restart phabricator -> read only -> read write [18:36:32] (03CR) 10Zppix: [C: 031] dbproxy: Failover phabricator dbs from db1048 to db1043 [puppet] - 10https://gerrit.wikimedia.org/r/315718 (https://phabricator.wikimedia.org/T146673) (owner: 10Jcrespo) [18:36:34] be ready to restart it, yes, when I say it [18:36:51] but I have to check the replication topology first [18:36:53] jynus: ok I'm logged in and ready [18:37:00] stand by for now [18:37:10] (03CR) 10Dzahn: "on contint1001: Notice: /Stage[main]/Jenkins/Service[jenkins]/enable: enable changed 'true' to 'false'" [puppet] - 10https://gerrit.wikimedia.org/r/315650 (owner: 10Hashar) [18:37:12] I will long when I start [18:37:14] *log [18:39:17] !log contint1001 - stop jenkins service [18:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:37] (03CR) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:39:54] (03PS8) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:40:18] (03CR) 10Dzahn: "manually stopped jenkins on contint1001 - puppet does not reactivate it , as desired" [puppet] - 10https://gerrit.wikimedia.org/r/315650 (owner: 10Hashar) [18:40:45] (03CR) 10jenkins-bot: [V: 04-1] Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:40:53] <_joe_> mutante: you should do systemctl mask or something like that [18:41:17] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2713969 (10BBlack) [18:41:48] paladox Ack [18:41:56] Yep [18:43:58] !log setting up circular replication db1043 <-> db1048 [18:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:04] _joe_: ok, done. it's puppetized as ensure => 'unmanaged' [18:44:08] ^we are not there yet [18:44:17] !log contint1001 - systemctl mask jenkins.service [18:44:21] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:28] hashar wanted it that way for now [18:44:40] o/ [18:44:50] hashar: see the "mask" part [18:44:53] mutante: ah mask !!!! :]] [18:45:15] I wanted to move the Jenkins data from gallium to contint1001 [18:45:24] when I realized that puppet would spawn jenkins magically [18:45:35] I got a lame puppet patch which is untested but might do it [18:45:45] hashar: yep, merging that change did not actually stop the service, but it "disabled" it [18:45:56] then i did a manual stop.. [18:46:03] then the mask in addition after joe's comment [18:46:08] ah yeah enable => false , prevent it from starting on boot [18:46:11] (03CR) 10Jcrespo: [C: 032] dbproxy: Failover phabricator dbs from db1048 to db1043 [puppet] - 10https://gerrit.wikimedia.org/r/315718 (https://phabricator.wikimedia.org/T146673) (owner: 10Jcrespo) [18:46:12] yes [18:46:30] twentyafterfour, about to start [18:46:31] PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [18:46:33] and I was expecting ensure => undef to prevent puppet from spawining it [18:46:54] hashar: it works, it just doesnt stop it. but after manually stopping it puppet will not restart it [18:46:55] looks like mask is the best solution [18:47:02] jynus: ok tell me when [18:47:09] hashar: well, now we have both [18:47:31] mutante: also we might want to ACK icinga checks for contint1001 until Oct 25th [18:47:36] schedules downtime for it [18:47:44] I am not sure whether they page ops [18:47:45] i already have that tab open [18:48:10] I tried to ack one yesterday, but apparently I lack write access on icinga [18:48:35] if you are doing something on phabricator, wait 1 minute before pushing save now [18:48:38] ACKNOWLEDGEMENT - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn gallium is still prod for now [18:48:38] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused daniel_zahn gallium is still prod for now [18:48:55] we will have one minute of read-only mode [18:49:00] hashar: we can fix the icinga access for you, but i got this one [18:49:11] scheduling downtime until Oct 25 [18:49:52] !log restbase deploy end of d510090 [18:49:55] !log setting phabricator db in read only mode for master failover [18:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:24] twentyafterfour, now should be the time to reload phab [18:50:35] jynus: reloading [18:50:43] confirm me when done [18:50:55] jynus: done [18:51:00] just when I wanted to post something :P [18:51:15] This error is showing [18:51:16] #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement [18:51:19] #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement [18:51:22] heh [18:51:22] On phabricator [18:51:25] LOL [18:51:29] at the same time [18:51:34] that's intended [18:51:38] to put and phab is back [18:51:45] Oh [18:52:12] don't say we didn't warn this 20 second of downtime! [18:52:18] ;) [18:52:28] jynus: ok to restart the phabricator job queue now? [18:52:37] yes [18:52:48] (I just stopped it for the switchover) [18:52:49] but is that hardcoded [18:52:49] ok thanks [18:52:50] ? [18:52:52] wait [18:52:55] ? [18:52:59] it should change the slave dns [18:53:09] does it connect to m3-slave? [18:53:22] (03PS9) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:53:22] it should connect to the same db as phabricator web [18:53:23] if it does, we should do a dns failover first [18:53:31] I can double-check [18:53:43] twentyafterfour, if it is m3-master, yes [18:54:00] let me do the dns change anyway [18:54:05] m3-master [18:54:17] then yes [18:54:19] ok cool [18:54:42] (03PS1) 10Zppix: Adds translations to the user's lang in the links within the readme in the ROOT dir. [puppet] - 10https://gerrit.wikimedia.org/r/315728 [18:54:54] there is 3 users sstill connected to the old host [18:55:01] :-/ [18:55:04] hmm... [18:55:07] I will kill them [18:55:19] ok should be fine, maybe straggling connections from phd? [18:55:20] jynus now now violence isnt the answer :P [18:55:37] (03PS1) 10Ori.livneh: Enable AbuseFilterCachingParser on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315729 [18:55:43] AaronSchulz: ^ [18:56:06] (03PS2) 10Zppix: Adds translations to the user's lang in the links within the readme in the ROOT dir. [puppet] - 10https://gerrit.wikimedia.org/r/315728 [18:56:22] twentyafterfour, they didn't reconnect, so things are cool [18:56:27] jynus: cool [18:56:48] I need a more advanced proxy to mitigate issues like those [18:57:40] (03CR) 10Aaron Schulz: [C: 031] Enable AbuseFilterCachingParser on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315729 (owner: 10Ori.livneh) [18:57:41] ok, now dns [18:57:59] (03CR) 10Zppix: [C: 031] Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:58:02] and tendril [18:59:01] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714049 (10BBlack) [18:59:25] (03CR) 10Madhuvishy: [C: 032] labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [18:59:33] (03PS14) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [18:59:37] (03CR) 10Madhuvishy: [V: 032] labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [18:59:42] (03PS2) 10Jcrespo: wmnet: remove labsdb1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/313894 (https://phabricator.wikimedia.org/T146455) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T1900). [19:00:26] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:00:30] (03CR) 10Mobrovac: [C: 04-1] "One super minor doc nit, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [19:00:49] * thcipriani digs through blockers [19:01:09] (03CR) 10Jcrespo: [C: 032] wmnet: remove labsdb1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/313894 (https://phabricator.wikimedia.org/T146455) (owner: 10Jcrespo) [19:02:17] thcipriani i thought there was a day apart from group 1 and group 2 deploy [19:02:19] (03PS1) 10Jcrespo: wmnet: Set db1048 as the new s3-slave after failover [dns] - 10https://gerrit.wikimedia.org/r/315732 (https://phabricator.wikimedia.org/T146673) [19:03:22] (03PS1) 10Madhuvishy: nfs: Fix drbd monitoring param name typo [puppet] - 10https://gerrit.wikimedia.org/r/315733 [19:03:23] Zppix: hrm? Typical cadence is Tuesday new branch group 0, Wednesday group1, Thursday group2 (all wikis) [19:03:25] (03CR) 10Jcrespo: [C: 032] wmnet: Set db1048 as the new s3-slave after failover [dns] - 10https://gerrit.wikimedia.org/r/315732 (https://phabricator.wikimedia.org/T146673) (owner: 10Jcrespo) [19:03:59] thcipriani eh, i was probably dreaming that then lol i cant keep track anymore [19:04:10] !log updating dns for labsdb1002 and m3-slave [19:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:06] (03PS1) 10Yuvipanda: tools: Update clush classifier prefix for static nodes [puppet] - 10https://gerrit.wikimedia.org/r/315735 [19:05:08] (03PS1) 10Yuvipanda: tools: Grant clush user complete sudo rights for everything [puppet] - 10https://gerrit.wikimedia.org/r/315736 [19:05:19] everything looks fine, I will upgrade db1048 tls and packages later now that it is passive [19:05:26] Zppix: :D for reference I usually just follow along with this https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys pretty much verbatim [19:05:52] (03CR) 10Madhuvishy: [C: 032] nfs: Fix drbd monitoring param name typo [puppet] - 10https://gerrit.wikimedia.org/r/315733 (owner: 10Madhuvishy) [19:06:01] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714092 (10BBlack) [19:06:58] "Unhandled Exception ("AphrontQueryException")" [19:07:10] if I search "--" [19:07:32] or - [19:07:35] jynus thats no bueno, hmm maybe too many results? [19:07:59] I am trying to do an sql injection [19:08:07] oh [19:08:11] hmm [19:08:43] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2714114 (10mobrovac) 05Resolved>03Open Nope nope, we still need to do this one. Sorry for dropping the ball on this one! [19:09:52] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:10:29] (03CR) 10Hashar: "I guess the commit message is not straightforward enough for the old folks :] All good so!" [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [19:11:01] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2713969 (10MoritzMuehlenhoff) seaborgium and serpens use certs from our internal CA, not from GlobalSign. [19:11:29] thcipriani, deploying? [19:11:54] i would like to push tilerator out - a non-publicfacing service [19:11:57] via scap3 [19:12:08] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:08] PROBLEM - Host sca1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:11] yurik: no, trying to figure out if https://phabricator.wikimedia.org/T147986 is still a thing, actually. Should be fine. [19:12:39] PROBLEM - Host sca2004 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:40] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714139 (10BBlack) The ones in the puppet repo under files/ssl/ are signed by GlobalSign.... I wonder what's out of sync here? [19:13:17] why are sca hosts dying above? [19:13:18] PROBLEM - Check size of conntrack table on sca2003 is CRITICAL: Connection refused by host [19:13:22] (03CR) 10Hashar: "Package is named 'php-ast' and there is no "php7.0-ast" one :/" [puppet] - 10https://gerrit.wikimedia.org/r/315711 (https://phabricator.wikimedia.org/T132636) (owner: 10Legoktm) [19:13:45] anyone? [19:14:09] PROBLEM - DPKG on sca2003 is CRITICAL: Connection refused by host [19:14:14] related to the deploy? [19:14:16] sca1003.eqiad.wmnet is indeed dark to me [19:14:26] or nothing to do? [19:14:37] PROBLEM - salt-minion processes on sca2003 is CRITICAL: Connection refused by host [19:14:43] I haven't deployed anything yet, FWIW [19:14:48] ok [19:15:06] PROBLEM - Disk space on sca2003 is CRITICAL: Connection refused by host [19:15:07] PROBLEM - zotero on sca2003 is CRITICAL: Connection refused [19:15:10] that seems like more than coincidence 1003,1004,2003,2004 in a short time [19:15:22] checking sca1004 console [19:15:26] thcipriani, oki, i will go ahead and sync it [19:15:27] who do we ping for sca hosts these days? [19:15:39] PROBLEM - configured eth on sca2003 is CRITICAL: Connection refused by host [19:15:39] yurik: have you pushed tilerator to sc hosts? [19:15:53] oh they're ganeti, no console [19:15:57] PROBLEM - dhclient process on sca2003 is CRITICAL: Connection refused by host [19:15:59] yurik: see above sc are dieing [19:15:59] hashar, sc? no, only to maps [19:16:00] well no physical console [19:16:08] yurik: ah different cluster, my bad sorry :] [19:16:17] PROBLEM - puppet last run on sca2003 is CRITICAL: Connection refused by host [19:16:27] only sca2003 seems down now [19:16:44] Network flap? [19:16:46] that should dissappear too [19:16:47] no [19:16:56] these boxes are not yet up [19:16:59] are they all ganeti? maybe a ganeti issue? [19:16:59] logstash shows a huge spike of events from 5k to 50k [19:17:07] I think they're all ganeti, yes [19:17:12] and damn icinga should not have even alerted [19:17:24] ignore all sca boxes please [19:17:29] akosiaris: ok :) [19:17:36] most mediawiki.DBReplication info/warning, so I guess some network flapped ? [19:17:43] akosiaris: related: I need to run puppet on icinga? [19:17:48] well, ideally anyways [19:17:53] hashar: Nah, if there's MW errors seems unrelated. [19:17:55] done already [19:17:57] Alex said ignore sca failure :) [19:17:59] bblack: ^ [19:18:17] Error: CERT_UNTRUSTED :D [19:18:40] 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2714150 (10MoritzMuehlenhoff) Can you remove the old certs, so that they don't cause confusion again? [19:19:00] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714151 (10MoritzMuehlenhoff) When we setup the openldap replacement servers for the OpenDJ setup, we started with an internal cert from the beginning. From what I... [19:19:18] PROBLEM - Check status of DRBD node on labstore1005 is CRITICAL: NRPE: Unable to read output [19:19:21] hashar: icinga looks fixed [19:19:23] (to me) [19:19:35] maybe it lost network connection somehow [19:19:49] labstore is on me [19:19:51] silencing [19:20:16] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714152 (10BBlack) [19:21:02] hrm, so this is the error that is listed in https://phabricator.wikimedia.org/T147986 blocking the train: https://logstash.wikimedia.org/goto/4d287374c4ff7795d66f8df27f492815 [19:21:34] I'm not clear that there is any correlation between this error and wmf.22 [19:22:48] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:27] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:25:52] can a non-busy operation team member merge and/or cr +2 this https://gerrit.wikimedia.org/r/#/c/315728/ all it is a minor doc change [19:25:56] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714157 (10BBlack) These are all fixed up now I believe, except for the 3x externally-hosted sites, which still link to the R1 root.... [19:27:33] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10hashar) OCG on ocg1001 ocg1002 ocg1003, started yielding CERT_UNTRUSTED error at 17:30 UTC One can monitor it via Grafana backend success... [19:30:24] Zppix: this has no importance, might be done later. There are currently bunch of operations going on [19:30:37] ack [19:30:46] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714176 (10BBlack) [19:31:40] !log deployed and restarted tilerator[ui] https://gerrit.wikimedia.org/r/#/c/315707/ [19:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:27] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Max retries exceeded with url: /?command=health (Caused by class socket.error: [Errno 111] Connection refused) [19:32:44] thcipriani: no idea what is the lock manager :/ [19:32:53] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714191 (10BBlack) @akosiaris found https://github.com/nodejs/node/blob/db1087c9757c31a82c50a1eba368d8cba95b57d0/src/node_root_certs.h [19:35:14] thcipriani: ah local-multiwrite has for lockManager a redis one which is rdb1 rdb2 rdb3 ? maybe they are in trouble [19:35:37] 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2714195 (10RobH) Just to confirm, the certs currently in the repo are no longer needed? (So no certs in files/ssl for this?) [19:36:14] then our redis config has soo many layers that I never managed to see the end and find out what is the actual instance having an issue :( [19:37:34] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2714199 (10RobH) [19:42:35] (03PS1) 10Madhuvishy: nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) [19:43:48] (03CR) 10jenkins-bot: [V: 04-1] nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [19:46:22] (03PS1) 10Zppix: Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 [19:46:43] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2714212 (10jcrespo) [19:46:45] 06Operations, 10DBA: db1034 decommission - https://phabricator.wikimedia.org/T139280#2714209 (10jcrespo) 05Open>03Resolved a:03jcrespo Will create a separate task when we stop using it, it is currently on full production. [19:47:07] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 721035 msg: ocg_render_job_queue 0 msg [19:47:08] Whenever someone from ops has a moment, something to ponder... [19:47:33] does it absouletly have to be someone from ops Revent cause im not busy if it doesnt [19:47:45] Zppix: Possibly not... [19:47:56] Just, looking at logs might be useful. [19:48:01] ok [19:48:06] whats the issue [19:48:11] https://commons.wikimedia.org/wiki/File:Map_of_Hindoostan,_1788,_by_Rennell.jpg <- this is a 182.99 MP image… [19:48:34] It seems ‘evident’ why it’s not thumbnailed, it’s over the 100MP limit defined in the config. [19:48:40] holy sh!t [19:48:50] ok, let me open a task [19:48:52] The question is more ‘why’ the older version ‘was’ successfully thumbnailed. [19:48:52] or something [19:48:53] Zppix: No [19:48:57] There's already one [19:48:59] oh [19:48:59] Zppix theres a task [19:49:00] link? [19:49:01] already [19:49:01] Revent: Exactly [19:49:46] 06Operations, 10DBA: Decommission db1035 - https://phabricator.wikimedia.org/T148078#2714228 (10jcrespo) [19:50:24] link to the task please [19:50:56] Reedy: I ‘think’ the original file had EXIF thumbnails in it. [19:51:28] https://phabricator.wikimedia.org/T147992 [19:51:38] Reedy ^^ Zppix [19:51:41] i think that is it [19:52:02] Probaly want to increase $wgMaxImageArea [19:52:05] 06Operations, 10DBA: Decommission db1015, db1035 and db1044 - https://phabricator.wikimedia.org/T148078#2714243 (10jcrespo) [19:52:25] paladox: Not really [19:52:32] We've numerous images that don't thumb due to being large [19:52:32] Oh [19:52:56] is anyone deploying anything? mind if I push out a small config patch? [19:53:01] (03PS2) 10Madhuvishy: nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) [19:53:08] Reedy is there a setting blocking the size? [19:53:23] paladox: $wgMaxImageArea [19:53:33] paladox: The limit is there because large thumbs take more server resources [19:53:37] Oh [19:53:42] So it's to try and make sure we get a nice error message [19:53:49] Yep [19:53:59] (03CR) 10jenkins-bot: [V: 04-1] nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [19:54:32] (03CR) 10Ori.livneh: [C: 032] Enable AbuseFilterCachingParser on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315729 (owner: 10Ori.livneh) [19:54:43] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714251 (10Nuria) Will get numbers for Mac OS requests on Chrome and Safari per hour for the last 3 days to quantify impact, let me know if you no lo... [19:54:58] (03Merged) 10jenkins-bot: Enable AbuseFilterCachingParser on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315729 (owner: 10Ori.livneh) [19:55:03] I’m also (honestly) a bit curious as to exactly what yann did to make the file get over 4x ‘bigger’ [19:55:17] Download the old version and have a look? :) [19:55:19] thcipriani: is something still holding wmf.22 to enwiki? [19:55:51] hashar: no, gotta remove that blocker, 1 second. [19:56:18] I’m guessing it was something with turning down jpeg compression, tbh, but I’m not ‘that’ curious. :P [19:56:40] thcipriani: pretty sure the filebackend/lock/redis is an ongoing issue on some of the redis instances [19:56:49] or one ends up being saturated/refusing to connect [19:57:01] maybe we could compress that file in the files commons [19:57:12] paladox: Just for ‘comparison’… https://commons.wikimedia.org/wiki/File:Vincent_van_Gogh_-_Irises_(1889).jpg <- thumbnails just fine [19:57:18] For gods sake [19:57:26] jpegs are already compressed [19:57:28] hashar: it definitely is. [19:57:30] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I8f6eb9f6af: Enable AbuseFilterCachingParser on testwiki and mediawikiwiki (duration: 00m 51s) [19:57:31] Ok thanks, LOL [19:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:44] i mean recompress Reedy [19:57:53] Zppix: And that helps, how? [19:57:56] PROBLEM - NTP on sca2003 is CRITICAL: NTP CRITICAL: No response from NTP server [19:57:57] Reedy what about develeping something server side that compresses them, but when downloading, you download the actual thing? [19:58:04] WHY ARE WE COMPRESSING THEM? [19:58:11] Reedy maybe it just compressed horribly [19:58:12] IT'S NOTHING TO DO WITH THE FILE SIZE IN MB [19:58:12] Since they get so large [19:58:19] Jesus [19:58:41] If you're not going to understand the problem, please don't make useless comments [19:58:48] (03PS3) 10Madhuvishy: nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) [19:58:52] Zppix: We do not ‘ever’, if possible, want to change the compression setting in a existing jpeg, it introduces a generation loss and compression artifacts (jpeg is lossy) [19:59:22] And compressing the ‘file’ is pointless, as jpeg files (already compressed) compress quite poorly. [20:00:28] But the point is the file size has absolutely nothing to do with the file size [20:00:36] (nods) [20:00:40] You could have a multi GB image that's 100x100px [20:00:44] Reedy ( Is this supposed to work https://upload.wikimedia.org/wikipedia/commons/4/43/Map_of_Hindoostan%2C_1788%2C_by_Rennell.jpg ) [20:00:45] ? [20:00:53] You can have a 1MB file that's 1000000x10000000 [20:00:54] Just wondering [20:01:04] That's not a thumbnail [20:01:08] Oh [20:01:09] It's just sending you the original [20:01:15] Oh, now i get it [20:01:17] And your browser is just resizing it [20:02:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:02:45] (03CR) 10Madhuvishy: [C: 032] nfs: Add sudo permissions for nagios user to run drbd commands [puppet] - 10https://gerrit.wikimedia.org/r/315742 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [20:03:49] The server does not mind the ‘file size’ (in MB) particularly, it’s the number of pixels, and the configured limit that prohibits trying to run ‘gigapixel’ images through ImageMagick because of the (insane) amount of ram it would eat… the server actually was not ‘supposed’ to thumbnail yannfs image. [20:07:27] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10debt) Hey @Gehel - can you do a bit of pair programming with @mpopov on this? We're thinking that he can... [20:09:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:10:34] ^ sigh, that was me submitting patches to move all wikis to wmf.22 [20:10:56] thcipriani must you break everything :D [20:10:59] grrrit-wm leaving that is, I don't know about the 500s [20:11:11] and yes :) [20:11:25] thcipriani too bad thats my job [20:11:36] I think grrrit-wm restarts every couple hours on the first day after doing a restart [20:11:42] But should improve over time [20:11:59] I did some changes, and found the first day grrrit-wm will continue to restart then stabalise [20:12:07] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.22 [20:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:36] paladox: I've notice it restart every time I use deploy-promote since that creates a patch, sends to gerrit, and +2s the patch at the same time. [20:13:50] Oh [20:14:22] Your patch looks the same as everyone so there must be something underneeth that does it [20:14:23] seeing tons of Warning: JsonConfig: Invalid $wgJsonConfigs['JsonZeroConfig']['remote']['url']: API URL is not set, and this config is not being stored locally [Called from JsonConfig\JCSingleton::parseConfiguration in /srv/mediawiki/php-1.28.0-wmf.22/extensions/JsonConf [20:14:49] I think that ^^ is related to it being converted to extension.json [20:15:28] thcipriani: some weeks issue [20:15:42] Oh never mind [20:15:47] Was converted in august [20:16:02] I'm rolling back. [20:16:17] :/ [20:17:01] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: Rollback 1.28.0-wmf.22 from group2 [20:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:19] https://phabricator.wikimedia.org/diffusion/EJSC/browse/master/includes/JCSingleton.php;765517c583b8209c4518126afed3325fb6371bec$145 [20:17:51] (03PS1) 10Thcipriani: Revert "all wikis to 1.28.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315749 [20:18:22] (03CR) 10Thcipriani: [C: 032] Revert "all wikis to 1.28.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315749 (owner: 10Thcipriani) [20:18:45] damn [20:18:51] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315749 (owner: 10Thcipriani) [20:19:36] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2714378 (10hashar) Alex can you do the magic SELECT again and see whether DNS entries are still being leaked? [20:19:53] yurik, ^ [20:20:03] thcipriani I think it's https://gerrit.wikimedia.org/r/#/c/303378/ [20:20:09] https://github.com/search?q=+%24wgJsonConfigs%5B%27JsonZeroConfig%27%5D%5B%27remote%27%5D%5B%27url%27%5D&type=Code&utf8=%E2%9C%93 [20:20:14] Reedy ^^ [20:20:16] Krenair, ? [20:20:59] jsonconfig/zeroportal configuration issue [20:21:38] hashar i think so [20:22:08] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2714379 (10Gilles) Note to self: look into setrlimit again, specifically the RSS limit, which would be a simpler solution than cgroups. [20:22:39] yurik, train rollback due to jsonconfig issue [20:22:57] Krenair, bleh [20:23:17] Krenair, its actually not jsonconfig, its the zerobanner not being customized correctly :( [20:23:24] There have been no recent updates to jsonconfig [20:23:31] https://phabricator.wikimedia.org/T147971 [20:24:21] yurik should we remove https://github.com/wikimedia/operations-mediawiki-config/blob/bbb692508d18bcc55441a21a62653c7110437fdd/wmf-config/mobile-labs.php#L11 ? [20:24:27] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2714390 (10AlexMonk-WMF) >>! In T115194#2714378, @hashar wrote: > Alex can you do the magic SELECT again and see whether DNS entries are sti... [20:24:31] and https://github.com/wikimedia/operations-mediawiki-config/blob/3210ea488229bd7071b04448ed1a67cd8402bc9d/wmf-config/mobile.php#L38 [20:24:32] ? [20:24:46] What does removing it fix? [20:24:49] paladox, that's a labs one, doesn't affect anything [20:24:53] Oh [20:24:53] ok [20:25:05] I think, as I've said numerous times, I think a merge strategy needs setting [20:25:36] Reedy, i'm not sure i understand what it means [20:25:40] The problem is the overly nested configuration [20:25:53] There's a reason other extensions don't have such complex config arrays [20:25:54] Reedy, the problem is that we have two different config systems at the same time [20:26:02] What? [20:26:03] No [20:26:17] Reedy maybe split up the configs into different sub dirs or files or something [20:26:25] !log Ran manual DB updates for T148057. [20:26:26] T148057: Fix user talk pages already in inconsistent state due to to T138310 - https://phabricator.wikimedia.org/T148057 [20:26:30] Zppix: That makes no sense [20:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:52] https://github.com/wikimedia/mediawiki/blob/master/docs/extension.schema.v1.json#L655-L664 [20:26:53] we used to have a simple array of arrays, and each extension would add a value wherever it needed it to that giant config tree. [20:26:57] yurik Yeh it needs the merge strategy per Reedy [20:27:02] Reedy what i meant was instead of the configs being in one spot and "overly nested" we move it into a simple array of sorts [20:27:03] I think we need array_merge_recursive [20:27:17] We should try ^^ [20:27:28] It may be needed in multple places [20:27:53] I suspect, it's compounded by the same variable being used by multiple extensions [20:27:58] now with the new extension.json, the config merge magic happens internally, and I really have no idea how it happens [20:27:58] And undeterminastic loading order [20:28:35] https://github.com/wikimedia/mediawiki/blob/master/includes/registration/ExtensionRegistry.php [20:28:39] Reedy i doint think that will work [20:28:51] Since the config is outside of "config" [20:29:06] But no harm in trying [20:29:10] https://phabricator.wikimedia.org/T147971#2711488 [20:29:14] curious what's broken? [20:29:18] https://phabricator.wikimedia.org/T147971#2711488 [20:29:26] audephone jsonconfig [20:29:27] zero [20:29:31] not jsonconfig [20:29:36] please be specific [20:29:38] Oh :( [20:29:58] zerobanner configuration gets set in two places - one in extension.json, and another part in commons.js [20:30:01] commons.php [20:30:08] Same as every other extension [20:30:32] wait, common.js isn't even executed at this point [20:30:39] yurik i guess that may be because all we do to check is [20:30:40] if ( $wmgZeroBanner && !$wmgZeroPortal ) { [20:30:49] Maybe a third config for labs only? [20:30:58] ^ [20:31:02] How does that help? [20:31:03] https://github.com/wikimedia/operations-mediawiki-config/blob/3210ea488229bd7071b04448ed1a67cd8402bc9d/wmf-config/mobile.php#L33 [20:31:06] paladox, labs php doesn't get loaded at all [20:31:10] Oh [20:31:18] The whole mobile config is a mess [20:31:19] labs got shut down correct? [20:31:31] But i thought you said here https://phabricator.wikimedia.org/T147971#2711488 wmf-config/mobile.php tries to modify ZeroBanner's values, and I think this is where it fails: [20:31:37] Both extensions are loaded twice [20:31:47] in CommonSettings.php and in mobile.php [20:32:46] it uses require_once [20:33:05] I'm not sure you can really call that a solution [20:33:10] hack, if anything [20:33:36] Reedy, and no, its not loaded in commonssetting [20:33:46] Yes it is [20:33:57] are you talking about jsonconfig itself? [20:34:01] or the zerobanner?L- [20:34:10] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L2993-L2999 [20:34:19] JsonConfig, ZeroBanner, ZeroPortal [20:34:37] Then we have [20:34:38] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/mobile.php#L33-L35 [20:34:54] Just set $wgJsonConfigs['JsonZeroConfig']['remote'] in some extension functions callback? [20:35:06] That's the simplest fix till the clusterfsck that is the mobile config is fixed [20:35:22] Reedy, in one case, it only loads if ZeroPortal is set, in the other, if its not set [20:35:56] But by that point [20:36:00] JsonConfig is already loaded [20:36:21] is it possible for the extension to load twice? [20:36:48] The fact that the config does it, is frankly wrong [20:36:59] oh, and we load it two different ways - wfLoadExtension( 'JsonConfig' ); and require_once( "$IP/extensions/JsonConfig/JsonConfig.php" ); [20:37:09] Yeh, that will load it twice [20:37:11] Well, that's vaguely the same [20:37:21] the php entry point still calls wfLoadExtenson() [20:37:31] But wont that then do it twice? [20:37:52] Won't what? [20:38:36] Load the extension twice [20:38:47] The extensionregistry queue looks like it should prevent that [20:38:50] Because it would be calling wfLoadExtension twice [20:38:56] the real question is if loading it twice would reset the `JsonConfigs` [20:39:04] paladox: I know you are really trying to be helpful, but wild speculation generally doesn't help when debugging a real problem. [20:39:05] if not, this is not the issue [20:39:05] The problem is the merging [20:39:51] As I said, the quickest fix I can see, is just set the API url in an extension functions callback [20:39:59] thcipriani Hi, i think i figured out why the grrrit-wm bot restarts on your patches, as soon as you upload you do +2, but it seems that instead of making it a seperate comment it just does it on the first. [20:40:17] that seems to be the only difference [20:40:28] Reedy, zerobanner gets loaded in both cases - for zeroportal and for all other wikis. And config needs to be different [20:40:49] I think the loading is just crap, but I don't think it's the problem [20:41:00] They're not loaded at the point that wfLoadExtension() is called [20:41:03] They're queued to be loaded [20:41:06] Which is the problem [20:41:17] The extensions are setting globals in the global scope [20:41:24] bd808 ok sorry [20:41:33] This then presumably gets overridden by the extension actually being loaded [20:41:42] And an inappropriate merge of config in global happens [20:42:01] how do globals from mobile.php get injected into that? [20:42:16] https://github.com/wikimedia/mediawiki/blob/master/includes/registration/ExtensionRegistry.php [20:42:18] at what point do the values merge with what's in the extension.json? [20:42:32] https://github.com/wikimedia/mediawiki/blob/master/includes/registration/ExtensionRegistry.php#L235-L294 [20:42:50] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:50:12] Reedy, i'm still going through that code - it seems that the merge strategy would need to be set on every instance of the JsonConfig's usage [20:50:26] Yeah, I'm gonna do it [20:50:35] well, every usage in an extension.json file [20:50:40] right [20:50:48] There's a reason we moved it to an attribute, ala https://github.com/wikimedia/mediawiki-extensions-JsonConfig/commit/a51b97ad117ba51dde4c956358d3281dfb378455 [20:51:17] queston is is array_plus or somethng is more appropriate [20:51:30] Reedy, and https://gerrit.wikimedia.org/r/#/c/315772 needs fixing - its an object [20:51:56] yeah, jenkins shat itself over it [20:51:57] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2714485 (10Cmjohnson) [20:52:56] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2708958 (10Cmjohnson) For the most part these are ready for @Joe. I did not add production DNS. Fully accessible via mgmt network. [20:54:55] (03PS2) 10Hashar: openstack: skip LDAP update for contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/314188 [20:54:59] Reedy, technically you only need to change the merge strategy in zerobanner [20:55:13] Shouldn't harm being there in all [20:55:29] Seems sensible for completeness, and prevention of future issues [20:55:36] (03CR) 10Hashar: "Still, I have no idea how to test this patch :( Maybe on labtest cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/314188 (owner: 10Hashar) [20:56:58] true [20:57:01] merged [20:57:07] making a pull req [20:57:14] I did them all? [20:57:16] i mean - cherrypick [20:57:16] (03PS1) 10Reedy: Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315828 [20:57:21] heh [20:57:27] ^ That would be the other workaround [20:57:55] LOL [20:57:59] (03CR) 10jenkins-bot: [V: 04-1] Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315828 (owner: 10Reedy) [20:58:00] ok, i guess that should work [20:58:42] missing l [20:58:43] missing ; [20:59:23] (03PS2) 10Reedy: Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315828 [21:00:24] (03CR) 10BryanDavis: [C: 04-1] Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315828 (owner: 10Reedy) [21:00:42] lol [21:01:03] (03PS3) 10Reedy: Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315828 [21:01:11] Let's see if we can get away without merging that patch [21:01:32] you could do use() instead of global there too I think. Not sure that one is better than the other [21:01:44] Mmm [21:02:03] the global keyword always gets my attention. it's like a #FIXME comment :) [21:02:46] Like I say, see if we can get away without merging that patch [21:03:04] We should be able to see this on beta when the extension patches get merged... [21:03:16] Reedy, i reworked your patches a bit, please tak ea look [21:03:41] (03PS2) 10Dzahn: Gerrit: Go back to pruning logs every 7 days [puppet] - 10https://gerrit.wikimedia.org/r/315519 (owner: 10Chad) [21:03:46] WFM yurik [21:03:54] Reedy, i moved it to top level [21:04:15] Yeah, hence Works For Me :) [21:04:37] hehe [21:04:47] bd808, agree, globals suck [21:04:53] (03CR) 10Dzahn: [C: 032] Gerrit: Go back to pruning logs every 7 days [puppet] - 10https://gerrit.wikimedia.org/r/315519 (owner: 10Chad) [21:05:16] !log attempting nodejs upgrade on ocg1001 [21:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:35] yurik: :) https://tools.wmflabs.org/bash/quip/AU7VTvVV6snAnmqnK_pD [21:06:10] lolol [21:06:24] you have to be able to hear Tim saying that aloud to really appreciate it I think [21:06:33] (03CR) 10Dzahn: [C: 04-1] "it will not be needed anymore soon because see https://phabricator.wikimedia.org/T119042#2713061 specifically step 2 in there" [puppet] - 10https://gerrit.wikimedia.org/r/311194 (https://phabricator.wikimedia.org/T119042) (owner: 10Paladox) [21:06:46] bd808, https://toggl.com/programming-princess -- see last [21:07:10] i have heard tim speak plenty to imagine that :))) [21:07:38] (03CR) 10Dzahn: "@Paladox well you applied it on labs, does Gerrit override it there or no? re: "may". and does your +1 stand without that addition or is i" [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [21:08:00] is there a task for: [Exception ErrorException] (/srv/mediawiki/php-1.28.0-wmf.22/languages/Language.php:463) PHP Fatal Error: Invalid operand type was used: cannot perform this operation with arrays ? [21:08:00] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:08:32] seems to have started on 2016-10-11 12:12:21 [21:08:53] I think there is one for that [21:09:20] yurik: I find generic PHP bashing like that comic to be pretty distasteful. Wikimedia, Facebook, Wordpress, Etsy, Baidu, Slack, ... lots of pretty successful PHP projects in the world. [21:09:38] it has quirks, but so do all languages [21:09:39] I hadn't seen a task for that one... [21:10:13] bd808, the real question is it DUE or DESPITE these quirks the projects are successful :) [21:10:17] bd808: But it's a fractal of bad design! [21:10:22] :P [21:10:26] it is a fractal of bad design [21:10:42] (03CR) 10Dzahn: [C: 031] "lgtm, but will cause gerrit restart" [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [21:10:46] Reedy, #wikimedia-interactive [21:11:25] 21:04:31 PHP Warning: array_key_exists() expects parameter 2 to be array, string given in /home/jenkins/workspace/mediawiki-extensions-qunit-jessie/src/extensions/JsonConfig/includes/JCSingleton.php on line 789 [21:11:25] (03CR) 10Chad: [C: 04-1] "Ah, typo in this anyway!" [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [21:11:25] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Max retries exceeded with url: /?command=health (Caused by class socket.error: [Errno 111] Connection refused) [21:11:25] honestly the only credible knock on PHP is that the standard library of functions is inconsistent. The rest is all language design bikeshedding [21:13:03] lack of a foreign function interface (there's a PECL extension that was last updated in 2004) [21:13:06] (03PS2) 10Chad: Gerrit: puppetize log4j.properties [puppet] - 10https://gerrit.wikimedia.org/r/315571 [21:13:33] no other web stack I've worked with has gotten shared nothing more correct than PHP. And deterministic destructors. [21:13:49] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 720930 msg: ocg_render_job_queue 0 msg [21:14:16] there are lots of amazing Perl projects that are succesful too, but it's ok to bash Perl :P [21:14:44] Its ok to bash bad Perl (and bad PHP) [21:14:57] CGIs did okay with shared-nothing [21:15:02] just saying.. [21:15:16] yes [21:15:20] * bd808 likes fcgi too [21:15:43] * bd808 is grouchy and old and likes fiddly old tech [21:15:54] !log updating nodejs on ocg1002 [21:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:48] * tom29739 likes stuff that works without large amounts of work [21:17:40] * ori likes languages that let you focus on the problem you're trying to solve rather than clutter your mind with litany of special cases [21:17:47] so by corollary any language is actually doing just fine with shared-nothing [21:18:31] it's not a property of a language, but of an execution model for a web service that can be implemented using any number of languages [21:19:11] gwicke: as long as the interpreter doesn't have a multi-second startup time. Which fcgi tries to sweep under the rug [21:19:33] if you do fcgi, you might as well just talk http [21:19:44] it doesn't imply shared-nothing [21:20:17] !log updating nodejs on ocg1003 [21:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:20:33] Reedy: I can't find it [21:20:54] gwicke: I guess it depends on the container. I'm thinking more pre-forked cgi workers [21:22:13] (03CR) 10Dzahn: zuul: refactor to use hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [21:23:19] bd808: PHP was the first web stack that packaged all that standard behavior in a very accessible package [21:25:22] which is mostly mod_php's achievement over the python & perl equivalents [21:25:39] That ^ [21:25:42] agreed. and it wasn't seen as quaint or outmoded until the EJB crowd tried to convince the world that only OOP and threads could be used to solve problems [21:26:00] Setting up a Perl/CGI script was always a pain compared to PHP (for less-savvy sysadmins) [21:26:25] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714531 (10BBlack) [21:26:28] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714528 (10BBlack) 05Open>03Resolved a:03BBlack Resolved for now. To recap: Initial symptom was lots of errors the ocg logs after we deployed the... [21:27:23] bd808: but hey, it was enterprise class! [21:27:54] convert mediawiki to Perl! [21:28:15] Python! [21:28:21] I've been trying to get the old Perl software I worked on to install. [21:28:26] Just for shits. [21:28:28] It's been a pain [21:28:44] asp.net! then sharepoint wiki has some real competition >.> [21:29:04] Wikipedia used to run UseModWiki, which was written in Perl [21:29:21] * gwicke used to hack on http://ispman.sourceforge.net/ [21:29:31] If we still used UseMod, maybe Ævar would still be around :) [21:30:25] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714536 (10BBlack) @Nuria - it would have to be specifically for MacOS Sierra (the new version that came out less than a month ago). There were othe... [21:30:45] Ævar was still around when we used phase3 [21:31:03] I know I'm just making Perl jokes :) [21:31:21] ori: CamelCase4Lyf [21:31:39] ostriches: kk, gotcha ;) [21:32:07] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714537 (10hashar) That is a very nice fix and summary. Thank you! [21:33:17] (03CR) 10Dzahn: "I think there is a big misunderstanding about what this is. This is not an attempt to change _anything_ inside the mariadb module or roles" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [21:35:33] mutante: Only change in the log4j stuff: https://gerrit.wikimedia.org/r/#/c/315571/1..2/modules/gerrit/templates/gerrit.config.erb [21:35:49] (03CR) 10Dzahn: "finally, it's not against Faidon's suggestion, it's making it possible to follow it and move all the other classes at once as Yuvipanda po" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [21:38:13] 06Operations, 06Analytics-Kanban, 10Traffic, 07HTTPS: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714544 (10Volans) FYI it's worth noticing that the upgrade of NodeJS for this service looks a bit broken by design to me, given that `apt-get` will over... [21:38:45] yurik: Do you know if anyone is working on fixing the ZeroBanner config bug (https://phabricator.wikimedia.org/T147971)? [21:39:00] kaldari, not to my knowledge [21:39:11] ostriches: that looks correct yep, do we wanna restart it now though? [21:39:30] yurik: what? not you? why not? [21:39:35] yurik: it seems to be blocking the deployment train (and there isn't going to be a deployment train next week) [21:39:41] I thoght you and Reedy were talking about that previously [21:39:42] jouncebot: next [21:39:42] In 1 hour(s) and 20 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T2300) [21:39:56] mutante: Now's good as ever I guess :) [21:40:12] oh, oops, greg-g kaldari, sorry, yes, we are working on it. I thought it was a different bug [21:40:14] (03CR) 10Dzahn: [C: 032] Gerrit: puppetize log4j.properties [puppet] - 10https://gerrit.wikimedia.org/r/315571 (owner: 10Chad) [21:40:20] I was about to get irate [21:40:24] yurik: lol, cool [21:40:27] :) [21:40:32] Reedy just merged a patch, testing in a bit [21:40:37] yay [21:40:55] !log gerrit is restarting for config change 315571 [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:04] greg-g, it is so easy to get you "irate"! I thought cali people were much more ... calm :-P [21:41:21] yurik: some people do a better job at pissing me off, yes. [21:41:30] \o/ [21:41:37] * yurik is proud of himself :-P [21:42:04] not really a winning plan [21:42:05] !log gerrit is back [21:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:11] I'm greg-g gets irate when needed :) [21:42:15] oops [21:42:25] :) [21:42:28] I mean... I'm glad greg-g gets irate when needed [21:42:45] I'm an irateness as a service [21:42:51] IaaS! [21:43:04] :) [21:43:14] IANAS [21:43:25] nice [21:47:59] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:57] what, it quits now? [21:49:05] mutante im restarting it [21:49:29] paladox: hah, thanks, i literally just logged in . would have pinged you but you were marked away [21:49:47] Oh [21:49:56] mutante im not sure why it marked me as away [21:49:59] I was still here [21:50:12] (03CR) 10Madhuvishy: [C: 032] nfs: Fix params passed to drbd monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/315836 (owner: 10Madhuvishy) [21:50:16] paladox: 14:39 -!- paladox is away: Auto away at Thu Oct 13 22:08:56 2016 [21:50:22] Oh [21:50:31] LOL it had my time [21:50:33] did you get my question in PM? [21:50:39] Nope [21:50:43] Oh [21:50:44] wait [21:50:44] yes [21:52:02] (03PS1) 10Chad: Move Aff|LegalContactPages to MetaContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315838 [21:54:19] (03PS3) 10PleaseStand: gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) [21:54:24] (03PS4) 10Paladox: gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [21:54:33] !log reedy@mira Synchronized wmf-config/mobile.php: Load wgJsonConfigs in callback (duration: 00m 56s) [21:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:40] (03CR) 10Paladox: [C: 031] gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [21:54:42] (03PS1) 10Reedy: Revert "Revert "all wikis to 1.28.0-wmf.22"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315839 [21:54:45] (03PS2) 10Reedy: Revert "Revert "all wikis to 1.28.0-wmf.22"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315839 [21:55:02] (03PS3) 10Reedy: all wikis to 1.28.0-wmf.22 take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315839 [21:56:24] jerkns seems it's fine [21:56:54] yeah jenkins broke just now [21:57:01] had to re-approve a few patches [21:57:13] (03CR) 10Reedy: [C: 032] all wikis to 1.28.0-wmf.22 take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315839 (owner: 10Reedy) [21:57:43] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.22 take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315839 (owner: 10Reedy) [21:58:15] Notice: Undefined property: MobilePage::$revisionTimestamp in /srv/mediawiki/php-1.28.0-wmf.22/extensions/MobileFrontend/includes/models/MobilePage.php on line 66 [21:58:54] Matanya reported that in Phab earlier [21:59:16] that one seems to spike and go back down. should be a task. [21:59:20] ah, ok [21:59:24] https://phabricator.wikimedia.org/T147993 [21:59:34] ok, so let's .22 again [21:59:34] (03CR) 10Dzahn: [C: 032] gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [21:59:53] running the train Reedy/ [21:59:53] ? [21:59:57] (03Abandoned) 10Chad: WIP: Bring over php::ini from MediaWiki vagrant [puppet] - 10https://gerrit.wikimedia.org/r/301285 (owner: 10Chad) [22:00:02] re-running, after I was involved in breaking :) [22:00:12] heh [22:01:04] !log reedy@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.22 take 2 [22:01:09] https://www.goodreads.com/book/show/104440.Hopping_Freight_Trains_In_America [22:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:54] It seems to be spamming that error stll [22:03:53] but it wasn't on mw1017...that's weird. [22:04:10] I did deploy it, didn't I? [22:04:31] Though, it looks to spike, and drop off again [22:05:44] Yeah, there's too many of them [22:06:08] you did deploy it, spot-checked on a few machines. [22:06:17] it == wmf-config/mobile.php [22:07:07] And it's right in eval.php [22:08:14] Oh fucks sake [22:08:17] fuc + [22:08:24] fuck + and everything it does to arrays [22:09:35] https://github.com/wikimedia/mediawiki-extensions-JsonConfig/commit/bfe6feec850e21c51a579b04c5a574c69354b54f [22:09:37] bad legoktm [22:09:53] Everywhere he's put a + needs to be an array merge... [22:09:55] or recursive? [22:09:59] RECOVERY - Check status of DRBD node on labstore1005 is OK: NRPE: Unable to read output [22:10:19] Yup [22:10:23] array_merge [22:10:30] thcipriani: I'm gonna make a fix to the extension, and gonna deploy it [22:10:47] kk [22:11:10] I really hate [] + []. I never remember what it is supposed to do [22:11:18] neither does lego :) [22:11:29] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:11:35] At this point [22:11:43] bd808: in JS it does empty string ;) [22:11:43] I suspect my hack to commonsettings isn't actually needed [22:13:33] yurik: What format is JsonConfigModels? [22:13:37] (03Abandoned) 10Paladox: archiva: Fix it not being a autoload module [puppet] - 10https://gerrit.wikimedia.org/r/311194 (https://phabricator.wikimedia.org/T119042) (owner: 10Paladox) [22:13:38] Should that be array_merge too? [22:13:56] volans, in PHP, the left hand array's keys overwrite the right hand array's keys [22:14:32] Reedy, its a key-value, and in theory it could be merged recursivelly. For example, ZeroPortal adds an extra key to one of the values [22:14:34] Krenair: at least you got an array :D [22:14:43] php > print_r( ['b'] + ['c'] ); [22:14:43] Array ( [0] => b ) [22:17:03] bd808: I think we file this one under "PHP sucks" [22:17:19] +1 [22:17:30] load gun, aim at foot, pull trigger, profit? [22:17:52] * Reedy waits for jenkins [22:18:51] c'mooon you stupid bot [22:20:36] Reedy, don't insult it or it will wreck its revenge upon thee [22:22:36] brb, need to get a phone charger while jenkins merging the cherry pick [22:24:35] (03PS1) 10Chad: Gerrit: Have log4j re-read its configuration every 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/315846 [22:25:16] #til about log4j monitorInterval [22:28:29] (03PS4) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (https://phabricator.wikimedia.org/T118478) [22:28:31] https://phabricator.wikimedia.org/P4217 [22:28:33] screw you php [22:29:56] wat [22:30:12] !log reedy@mira Synchronized php-1.28.0-wmf.22/extensions/JsonConfig/: less array + array more array_merge (duration: 00m 57s) [22:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:21] thcipriani: That paste is essentially what the problem is/was [22:30:30] So, now that's live... We should see the errors dropping off [22:30:32] /warnings [22:31:22] Hang on [22:31:28] That's not gonna fix it all, is it? [22:32:13] Well, it'll fix that error [22:32:16] Not the wanted config [22:32:17] We need array_merge_recursive [22:32:43] ffs [22:32:52] array_merge_forreal() [22:33:04] array_merge_do_what_a_human_expects [22:33:09] hrm, well the bars in fatal monitor are getting shorter [22:33:29] array_do_the_impossible() [22:33:37] thcipriani: it's the error solstice [22:33:39] (03CR) 10Dzahn: [C: 032] Gerrit: Have log4j re-read its configuration every 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/315846 (owner: 10Chad) [22:33:55] thcipriani: the fatals are going away, as we the debugging in the code is alright now [22:33:59] getting closer anyway. [22:34:00] ori: hah [22:34:00] the resultant config doesn't appear though [22:34:35] mutante: whee thx. That'll make life easier :) [22:34:40] especially as I experiment with logstash [22:35:49] yes, avoiding restart = good :) [22:35:57] even checked the log4j option, yep [22:36:07] This is compounded by the zero/jsonconfig config [22:36:08] array(1) { [22:36:08] ["JsonZeroConfig"]=> [22:36:08] array(2) { [22:36:09] etc [22:44:11] (03PS1) 10Chad: Gerrit: Use -name "*.gz" in log rotation cron [puppet] - 10https://gerrit.wikimedia.org/r/315849 [22:44:52] (03CR) 10Paladox: [C: 031] Gerrit: Use -name "*.gz" in log rotation cron [puppet] - 10https://gerrit.wikimedia.org/r/315849 (owner: 10Chad) [22:45:39] (03CR) 10Dzahn: [C: 032] "yep, works" [puppet] - 10https://gerrit.wikimedia.org/r/315849 (owner: 10Chad) [22:45:42] (03CR) 10Chad: "demon@cobalt:/var/lib/gerrit2/review_site/logs$ find /var/lib/gerrit2/review_site/logs/ -name "*.gz" -mtime +7 -exec {} \\;" [puppet] - 10https://gerrit.wikimedia.org/r/315849 (owner: 10Chad) [22:46:27] mutante: Maybe just pipe to /dev/null? [22:46:46] I mean, it's not end of the world if it fails, especially a false positive. [22:46:58] !log reedy@mira Synchronized php-1.28.0-wmf.22/extensions/JsonConfig/: array_merge_recursive (duration: 00m 50s) [22:47:00] ostriches: it's silent like this, i tested it [22:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:19] (03CR) 10Paladox: "@Chad that's because you did \\" [puppet] - 10https://gerrit.wikimedia.org/r/315849 (owner: 10Chad) [22:47:33] Oh herp derp, double escape fail [22:47:35] Yeah that'll work [22:48:05] well, on lead :p [22:48:25] why not -delete? [22:48:32] uhhh, logstash array_merge_recursive made things angry. [22:48:42] to avoid the ugly exec mutante :) [22:49:08] (03Draft1) 10Paladox: Gerrit: Fix double escaping [puppet] - 10https://gerrit.wikimedia.org/r/315853 [22:49:10] (03Draft2) 10Paladox: Gerrit: Fix double escaping [puppet] - 10https://gerrit.wikimedia.org/r/315853 [22:49:17] mutante ostriches ^^ [22:49:17] paladox: No, it's right in puppet [22:49:20] Was wrong on my end [22:49:25] oh [22:49:39] (03PS1) 10Reedy: Revert "Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315854 [22:49:43] (03PS2) 10Reedy: Revert "Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315854 [22:49:46] (03CR) 10Reedy: [C: 032] Revert "Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315854 (owner: 10Reedy) [22:49:50] But it shows as only needing one \ here http://stackoverflow.com/questions/2961673/find-missing-argument-to-exec [22:49:50] volans: #til about `find -delete` [22:49:59] Good to know I was right the hack isn't needed ;) [22:50:19] (03Merged) 10jenkins-bot: Revert "Hack for making sure $wgJsonConfigs['JsonZeroConfig'] is set late enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315854 (owner: 10Reedy) [22:50:30] -delete = better performance rm = more portable [22:50:35] either works for me [22:50:39] (03PS1) 10Chad: Gerrit: Use -delete in log rotation find [puppet] - 10https://gerrit.wikimedia.org/r/315855 [22:51:23] mutante: true, old ones might not have the -delete, but given our OSes I think all of them have it [22:51:37] ack, yes [22:51:37] Well, the gerrit role requires precise or newer :p [22:51:44] I think we're ok ;-) [22:51:57] s/precise/jessie/ [22:52:19] (03CR) 10Paladox: [C: 031] Gerrit: Use -delete in log rotation find [puppet] - 10https://gerrit.wikimedia.org/r/315855 (owner: 10Chad) [22:52:25] (03CR) 10Dzahn: [C: 032] Gerrit: Use -delete in log rotation find [puppet] - 10https://gerrit.wikimedia.org/r/315855 (owner: 10Chad) [22:53:24] to be super picky that cron block probably should have a require on '/var/lib/gerrit2/review_site/logs/' :) [22:53:36] !log reedy@mira Synchronized wmf-config/mobile.php: Revert my hack (duration: 00m 49s) [22:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:54] and with this I think I'm heading off to bed for today :) [22:54:36] volans: I guess technically yes, those crons are kind of an anti-pattern to an otherwise well-puppetized gerrit [22:56:52] has "pl proc line: 2959: warning: points must have either 4 or 2 values per line" been filed? [22:57:01] yeah [22:57:08] (03CR) 10Dzahn: [C: 032] "no-op in prod, labs-only http://puppet-compiler.wmflabs.org/4348/" [puppet] - 10https://gerrit.wikimedia.org/r/313937 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [22:57:17] it's burried a bit in the phab interface, but it's there [22:57:18] (03PS7) 10Dzahn: phabricator: Configuration for Aphlict [puppet] - 10https://gerrit.wikimedia.org/r/313937 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [22:58:30] Reedy: so we're getting ready to run into evening SWAT. Might be better to revert to wmf.21. [22:58:43] thcipriani: We're good tbh [22:59:25] :( [22:59:30] wait [22:59:35] what, how [22:59:37] this doesn't make sense [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161013T2300). Please do the needful. [23:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] hang on [23:00:12] that's a different error [23:00:20] Warning: JsonConfig: In $wgJsonConfigs['JsonZeroConfig']['remote'] is set for the config that will be stored on this wiki. 'remote' parameter will be ignored. [23:00:24] uno momenton on SwAT [23:00:27] I'm fine with delaying my SWAT patch deployment for n hour or two [23:00:28] momento* [23:01:01] this error started spiking after the array_merge_recursive patch [23:01:30] Message earlier is [23:01:31] Invalid $wgJsonConfigs['JsonZeroConfig']['remote']['url']: API URL is not set, and this config is not being stored locall [23:01:40] 06Operations, 10Phabricator, 10Traffic, 13Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2714910 (10Dzahn) merged per prototype/"labs-only" no-op in prod http://puppet-compiler.wmflabs.org/4348/ [23:02:00] right [23:02:25] Is this one coming from zerowiki? [23:02:28] Where this shouldn't be set? [23:02:38] This can't be a new error... Or it's just been masked for a long time [23:03:04] This is what I was saying earlier about this whole config beng a mess [23:03:40] There's also [23:03:40] JsonConfig: Invalid type of one of the parameters in $wgJsonConfigs['JsonZeroConfig'], please check documentation [23:03:46] Which is unhelpful. Which one? [23:04:58] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2714922 (10Dzahn) We talked about this briefly and agreed that "germanium" is free and can be used to avoid reusing a name. The VMs for PCI scanning _might_ be re-created one day and reusing nam... [23:06:29] (03PS1) 10Reedy: Only set remote config if wiki isn't Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315859 [23:06:41] (03PS2) 10Reedy: Only set remote config if wiki isn't Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315859 [23:06:44] (03CR) 10Reedy: [C: 032] Only set remote config if wiki isn't Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315859 (owner: 10Reedy) [23:07:12] (03Merged) 10jenkins-bot: Only set remote config if wiki isn't Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315859 (owner: 10Reedy) [23:08:53] !log reedy@mira Synchronized wmf-config/mobile.php: Only set remote config if not zerowiki (duration: 01m 15s) [23:08:56] So, if this doesn't fix it [23:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:01] I'm gonna disable the extension [23:09:02] ;) [23:11:24] hrm. Error rate seems about the same. [23:12:43] It's definitely not there in eval.php for zerowiki now [23:13:30] How do we tell which db this is on? [23:14:11] for hhvm errors I'm not clear if there is a way. [23:14:18] bd808: ^ [23:15:09] ["isLocal"]=> [23:15:09] bool(false) [23:15:15] That can't be right on zero [23:17:19] thcipriani: I think we should revert at this point [23:17:29] Reedy: agreed [23:17:43] I'm going down rabbitholes of the actual config of the Zero extensions [23:17:47] Which just seem to be completely wrong [23:18:14] (03PS1) 10Reedy: Revert "all wikis to 1.28.0-wmf.22 take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315860 [23:18:18] (03PS2) 10Reedy: Revert "all wikis to 1.28.0-wmf.22 take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315860 [23:18:22] (03CR) 10Reedy: [C: 032] Revert "all wikis to 1.28.0-wmf.22 take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315860 (owner: 10Reedy) [23:18:26] whatever is happening is definitely non-obvious, can't fix in prod :( [23:18:42] Yeah [23:18:46] This really doesn't make sense [23:18:52] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.22 take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315860 (owner: 10Reedy) [23:19:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:19:01] when this has merged, I'll have a look at a couple of things with the confg [23:19:31] !log reedy@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis back to .21 [23:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:20] Oh ffs [23:20:24] That hasn't reverted zerowiki [23:21:11] Has it fixed the errors? [23:21:48] (number of "ffs" in -operations is getting close to "Reedy conducting the train" levels) [23:22:04] cut the errors down by a lot, but they're still there for sure. [23:22:38] error rate is like a 1/4 of what it was [23:23:41] "ffs" in -operations - 11.11% of data above the critical threshold [23:29:19] Well we abandoning .22 completely?? [23:29:25] No [23:29:36] Then whats the plan [23:29:42] Party [23:29:47] We revert ZeroBanner and ZeroPortal if we want to continue with .22 [23:30:17] is this all just due to extension.json conversion? [23:30:23] Reedy, that bad? [23:30:42] yurik: Well, I fixed it up so the config was right, then zero was getting config it shouldn't have [23:30:45] Creating different warnings [23:30:51] I'm kinda confused how it works now [23:31:25] Reedy, all wikis should get the same config, except zeroportal [23:31:31] Right [23:31:37] zeroportal will havea different one that it sets itself [23:31:48] But even when I guarded it, the error was still there [23:31:59] can you point me to the error [23:32:05] And due to hhvm, we can't see what wiki it was on [23:32:20] sigh [23:32:27] Warning: JsonConfig: In $wgJsonConfigs['JsonZeroConfig']['remote'] is set for the config that will be stored on this wiki. 'remote' parameter will be ignored. [23:32:55] quick question: who's planning on doing the SWAT deployment after the train? [23:33:11] Reedy, that's fine [23:33:20] yurik: But it was spamming like nobodies business [23:33:25] You should disable the warning then [23:33:40] can we revert the past few patches to jsonconfig? error rate is still much higher than it was when we started down this path. [23:33:41] Reedy, that it shouldn't do, because the only wiki that should have that is zeroportal, which noone uses [23:34:13] if its spamming like crazy, it means either zeroportal ext got enabled somewhere where it shouldn't, ro something else is wrong [23:34:13] yurik: as Reedy said, spamming the logs is not fine. [23:34:21] kaldari: No one claimed it... [23:34:33] yurik: But reverting the extension.json changes shouldn't fix that [23:34:35] That just doesn't make sense [23:34:40] It seems the config is just fubar [23:35:25] Reedy, has the configs changed recently? I say we should revert zerobanner and zeroportal changes and revert config changes [23:35:33] clearly the upgrade to extension.json is not ready then [23:35:36] I don't think it has [23:35:46] I don't see how those changes can break it [23:35:52] neither can i [23:35:58] but if nothing else changed, it must be it [23:36:20] It must be, but it doesn't really make sense [23:36:31] or the interpretation of the config is broken in ZB [23:36:34] I think we should just reset at this point and go back to the drawing boards to try to solve this issue. [23:36:47] yes, no more effing with prod now [23:36:54] Yeah [23:37:01] revert the hell out of this and get us in a sane place for 10 days [23:37:22] hehe [23:37:23] agree [23:37:44] and i really wish there was a sane way to test it :( [23:37:52] Reedy: could you make reverts for those JsonConfig changes? [23:38:08] I just made reverts for Zero extensions [23:38:11] thcipriani, these are not jsonconfig :) [23:38:14] zero only [23:38:36] JC has been working fine for several months with the new extension.json stuff [23:39:10] I'll just merge the reverts on the .22 branches [23:39:20] Makes it a bit easier for people to poke further on master [23:39:27] ah, these two cancel eachother out seemingly https://github.com/wikimedia/mediawiki-extensions-JsonConfig/commits/master [23:39:44] sorry, lots of changes :) [23:40:22] https://gerrit.wikimedia.org/r/#/c/315859/ should maybe be reverted. But it shouldn't be hurting anything (it's still live now) [23:41:58] (03PS1) 10Dzahn: netboot/LVS: fix partman recipes used by LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/315869 (https://phabricator.wikimedia.org/T136737) [23:42:29] 23:41:24 1) JsonConfig\Tests\JCTitleParsingTest::testTitleParsing with data set #0 (false, null, false) [23:42:30] 23:41:24 JsonConfig: Invalid $wgJsonConfigs['JsonZeroConfig']['remote']['url']: API URL is not set, and this config is not being stored locally [Called from JsonConfig\JCSingleton::parseConfiguration in /srv/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/JsonConfig/includes/JCSingleton.php at line 147] [23:42:36] Fuck. Right. Off [23:42:46] (03PS3) 10Rush: WIP: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 [23:42:54] This just points out that it's hardly the extension registration changes [23:43:34] /the whole thing is a mess [23:43:51] (03CR) 10jenkins-bot: [V: 04-1] WIP: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 (owner: 10Rush) [23:44:13] Reedy, yeah, it means there is no remote config for all wikis. Reedy, how about a quick hangout later today or tomorrow, where we can go over it - your knowledge of the extension config + mine of zero should help solve this [23:44:27] It's nearly 1am for me :) [23:44:35] tomorrow then :) [23:44:42] the error rate is way higher than when we started troubleshooting, so I'd say that whatever merged and was deployed to troubleshoot this issue ought to be reverted. [23:45:15] thcipriani: https://gerrit.wikimedia.org/r/#/c/315859/ is the only live change that hasn't been reverted [23:45:34] That takes us back to a level field [23:45:43] (03PS1) 10Reedy: Revert "Only set remote config if wiki isn't Zero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315870 [23:45:49] (03PS2) 10Reedy: Revert "Only set remote config if wiki isn't Zero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315870 [23:45:52] (03CR) 10Reedy: [C: 032] Revert "Only set remote config if wiki isn't Zero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315870 (owner: 10Reedy) [23:45:54] (03PS1) 10Thcipriani: Revert "Only set remote config if wiki isn't Zero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315871 [23:46:06] ah, quicker on the draw. [23:46:07] heh [23:46:19] thcipriani: please ping me when the train is finished rolling :) [23:46:19] I don't understand how that could be causing the difference in error rate I'm seeing. [23:46:20] (03Merged) 10jenkins-bot: Revert "Only set remote config if wiki isn't Zero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315870 (owner: 10Reedy) [23:46:24] kaldari: will do [23:47:40] thcipriani, higher traffic rates now? [23:47:45] general traffic? [23:47:57] !log reedy@mira Synchronized wmf-config/mobile.php: Back to pre deploy state (duration: 00m 49s) [23:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:09] would have to be ~1,000x higher [23:49:12] https://logstash.wikimedia.org/goto/f835c94eb93494ba43ea22a186c275da [23:49:36] !log reedy@mira Synchronized php-1.28.0-wmf.22/extensions/ZeroBanner: Revert extenson registration (duration: 00m 50s) [23:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:38] !log reedy@mira Synchronized php-1.28.0-wmf.22/extensions/ZeroPortal: Revert extenson registration (duration: 00m 49s) [23:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:53] thcipriani, it seems like its dropping [23:51:05] yup, those last 2 syncs I think did it [23:51:22] magic :( [23:51:34] thanks Reedy :) [23:51:37] That doesn't make sense though [23:51:43] magical Reedy :) [23:51:44] ZeroBanner is only on wikipedias [23:51:47] Who are on .21 [23:51:55] yeah, it totally does not make any sense [23:52:16] some js cached somewhere? [23:52:48] (03PS1) 10Dzahn: repeat hostnames for IPv6 for installservers [dns] - 10https://gerrit.wikimedia.org/r/315872 [23:53:01] for zero clients, the HTML is different - it loads JS snippet to draw the banner [23:53:18] but still i cannot think of how it could cause all these errors [23:53:48] (03PS1) 10Reedy: Revert "Revert "all wikis to 1.28.0-wmf.22 take 2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315873 [23:53:53] (03PS2) 10Reedy: Revert "Revert "all wikis to 1.28.0-wmf.22 take 2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315873 [23:54:00] (03CR) 10Dzahn: [C: 032] repeat hostnames for IPv6 for installservers [dns] - 10https://gerrit.wikimedia.org/r/315872 (owner: 10Dzahn) [23:54:04] (03PS3) 10Reedy: all wikis to 1.28.0-wmf.22 take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315873 [23:54:43] Do we give it a try, again? [23:55:16] what would be different this time? [23:55:29] No extension registration in either Banner or Portal [23:55:32] Oh [23:55:39] I know why the error rate will have dropped at my last syncs [23:55:47] Cause ZeroWiki is still on .22, isn't it? [23:55:56] yeah, group0 [23:56:07] Ok, so that error drop sn't so magic [23:56:08] xD [23:56:48] it should be moved to group2 i think - noone pays much attention to it [23:57:24] (03CR) 10Reedy: [C: 032] all wikis to 1.28.0-wmf.22 take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315873 (owner: 10Reedy) [23:57:27] hmm, actually no, because some wp are group1 [23:57:33] i think ZP should also be group1 then [23:57:51] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.22 take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315873 (owner: 10Reedy) [23:58:58] it's 5pm on Thursday. There's not much more time for experiementation. As in, none. [23:58:58] !log reedy@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.22 take 3 [23:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master