[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T0000). [00:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:01:21] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:27] \o [00:01:45] i suppose i can ship my own patches [00:06:08] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10Scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3027833 (10thcipriani) 05Open>03Resolved a:05mmodell>03thcipriani What's... [00:07:43] !log ebernhardson@tin Synchronized php-1.29.0-wmf.11/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: (no justification provided) (duration: 00m 50s) [00:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:58] (03CR) 10EBernhardson: [C: 032] Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:08:05] (03CR) 10jerkins-bot: [V: 04-1] Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:08:26] rebasing ... [00:10:04] (03PS3) 10EBernhardson: Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) [00:13:34] (03CR) 10EBernhardson: [C: 032] Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:15:00] (03Merged) 10jenkins-bot: Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:15:55] (03CR) 10MaxSem: [C: 031] "We probably need to remove exceptions from https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/text-f" [dns] - 10https://gerrit.wikimedia.org/r/337522 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [00:16:10] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Configure cirrus per-index setings for elasticsearch 5 (duration: 00m 43s) [00:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:51] (03CR) 10jenkins-bot: Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:22:51] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:54] !log ebernhardson@tin Synchronized php-1.29.0-wmf.11/extensions/CirrusSearch/: Provide per-index settings from configuration for elasticsearch 5 (duration: 00m 55s) [00:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:51] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [00:30:21] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [00:32:16] (03PS1) 10MaxSem: Enable mobile redirection for all wikimanias [puppet] - 10https://gerrit.wikimedia.org/r/337767 [00:33:01] (03CR) 10MaxSem: "Adding Jon for Reading Web review." [puppet] - 10https://gerrit.wikimedia.org/r/337767 (owner: 10MaxSem) [00:56:18] (03CR) 10Faidon Liambotis: [C: 04-1] "This is fine for now and I'd be OK with merging it, but at some point we'll have to move the NTP servers to stretch and then this will bre" [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [00:57:44] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [00:58:14] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:09:14] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [01:09:44] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [01:12:44] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:41:44] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:03:28] (03PS1) 10Madhuvishy: diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 [02:35:08] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 12m 50s) [02:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:59] (03PS2) 10Madhuvishy: diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 [02:40:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 15 02:40:31 UTC 2017 (duration 5m 23s) [02:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:17] (03PS2) 10Brion VIBBER: Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) [02:55:59] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3028263 (10bd808) [03:02:14] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:02:44] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [03:08:44] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [03:09:14] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [03:31:54] PROBLEM - puppet last run on mc1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:59:54] RECOVERY - puppet last run on mc1024 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [04:03:04] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:43] (03PS1) 10JustBerry: Adding ICU libicu52 and python wrapper PyICU package for k8s to Dockerfile. Installs ICU (International Components for Unicode) library libicu52 (dependency) and PyICU (python wrapper). See https://packages.debian.org/jessie/python-pyicu for package infor [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337770 [04:03:54] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.155 second response time [04:06:35] (03Abandoned) 10JustBerry: Adding ICU libicu52 and python wrapper PyICU package for k8s to Dockerfile. Installs ICU (International Components for Unicode) library libicu52 (dependency) and PyICU (python wrapper). See https://packages.debian.org/jessie/python-pyicu for package infor [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337770 (owner: 10JustBerry) [04:06:55] yuvipanda: ditched ^^ [04:18:14] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1802.562396 Seconds [04:19:14] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [04:30:54] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.177 second response time [04:32:04] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:46:28] (03PS1) 10Yuvipanda: tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) [04:46:45] (03CR) 10jerkins-bot: [V: 04-1] tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) (owner: 10Yuvipanda) [04:48:06] (03PS3) 10Yuvipanda: tools: Upgrade docker on tools k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/336573 (https://phabricator.wikimedia.org/T157180) [04:48:08] (03PS2) 10Yuvipanda: tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) [04:49:11] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Upgrade docker on tools k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/336573 (https://phabricator.wikimedia.org/T157180) (owner: 10Yuvipanda) [05:11:41] (03PS1) 10Yuvipanda: tools: Enable cronjobs for tools k8s [puppet] - 10https://gerrit.wikimedia.org/r/337776 (https://phabricator.wikimedia.org/T158155) [05:21:23] (03CR) 10BryanDavis: [C: 031] tools: Enable cronjobs for tools k8s [puppet] - 10https://gerrit.wikimedia.org/r/337776 (https://phabricator.wikimedia.org/T158155) (owner: 10Yuvipanda) [05:59:34] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=308.50 Read Requests/Sec=647.20 Write Requests/Sec=3.20 KBytes Read/Sec=31854.40 KBytes_Written/Sec=1277.20 [06:11:34] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=13.70 Read Requests/Sec=0.00 Write Requests/Sec=1.10 KBytes Read/Sec=0.00 KBytes_Written/Sec=33.20 [06:37:44] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:44] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:04] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:04] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:04] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:14] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:14] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:14] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:34] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:38:34] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:38:54] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [06:38:54] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:38:54] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:39:04] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:39:04] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:39:04] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:12] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10Marostegui) ``` 04:23 < yuvipanda> marostegui: jynus I can verify that I can access labsdb1004 from tools, so no need to massage VLANs or fi... [07:21:46] (03PS1) 10Marostegui: db-codfw.php: Repool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337777 (https://phabricator.wikimedia.org/T156478) [07:24:49] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337777 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:26:36] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337777 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:26:45] (03CR) 10jenkins-bot: db-codfw.php: Repool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337777 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:27:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2062 - T156478 (duration: 00m 42s) [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:53] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [07:33:43] !log Deploy alter table on x1 master (db1031) for the echo_notification tables - T136428 [07:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [07:45:42] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3028446 (10Gilles) Where can we test the RESTBase API version of hovercards? [07:49:52] (03CR) 10Muehlenhoff: "Good point, having the Diamond collectors in the ntd/timesyncd standard classes would be cleaner, I'll update the patch later on." [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [07:58:43] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3028454 (10MoritzMuehlenhoff) @Robh: Yes, having his entry in data.yaml without an expiry date is just fine, all volunteers have that. The expiry date is only ne... [08:05:30] (03CR) 10Gilles: [C: 031] performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [08:05:38] (03CR) 10Gilles: [C: 031] Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [08:21:14] !log installing PHP security updates on Ubuntu systems [08:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:33] 06Operations, 10DBA: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#3028551 (10MoritzMuehlenhoff) My two cents: From a high level view I personally prefer the systemd unit to be in the Debian package since it's part... [08:42:33] (03CR) 10Filippo Giunchedi: "> https://gerrit.wikimedia.org/r/#/c/336420/ (not yet merged) adds" [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) (owner: 10Filippo Giunchedi) [08:46:26] (03PS2) 10Yuvipanda: tools: Enable cronjobs for tools k8s [puppet] - 10https://gerrit.wikimedia.org/r/337776 (https://phabricator.wikimedia.org/T158155) [08:49:30] (03PS3) 10Filippo Giunchedi: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [08:54:02] (03PS4) 10Filippo Giunchedi: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [08:59:47] (03CR) 10Filippo Giunchedi: "Hashar, I've addressed your comments re: ensure and class paramenter!" [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:01:17] (03CR) 10Muehlenhoff: Make the experimental archive section generally available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:01:20] (03PS1) 10Madhuvishy: labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 [09:02:48] (03CR) 10jerkins-bot: [V: 04-1] labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 (owner: 10Madhuvishy) [09:04:19] (03PS2) 10Madhuvishy: labstore: Read directory size diamond collector config from external file [puppet] - 10https://gerrit.wikimedia.org/r/337785 [09:05:37] (03PS3) 10Gehel: elasticsearch - reimage to jessie and move data to /srv - preliminary work [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) [09:09:05] (03CR) 10Gehel: [C: 032] elasticsearch - reimage to jessie and move data to /srv - preliminary work [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:12:31] _joe_: I've got a puppet patch to bump the video scaler queue runner counts up, should fill the CPU better while retaining good behavior with the prio queue: https://gerrit.wikimedia.org/r/#/c/337230/ [09:13:01] let me know if that can be run whenever or needs to wait for another swat window :) [09:13:12] <_joe_> brion: thanks, i'll take a look and merge it if it makes sense [09:13:18] thanks! [09:13:42] the new queue's working great so far, just underutilizing due to the conservative runner count :) [09:13:55] <_joe_> we are limited in our manouvering space there, alas [09:14:16] yep [09:15:08] <_joe_> if we only had mw on kubernetes + autoscaling on CPU, how much time would we spare here. Well, maybe 1.5 years in the future :P [09:15:15] hehe [09:16:09] brion: good evening. There are also some overrides for mw1168 and mw1169 :) [09:16:17] hieradata/hosts/mw1168.yaml:mediawiki::jobrunner::runners_transcode: 4 [09:16:18] hieradata/hosts/mw1168.yaml:mediawiki::jobrunner::runners_transcode_prioritized: 12 [09:16:30] oh yeah i should possibly just take those out [09:16:34] lemme check the cpu counts [09:16:53] mw116[89] are the last hosts added, a bit more powerful [09:17:13] is that.... 16 cores + hyperthreading? nice [09:17:38] with HT that is like 22.67 CPU available ? :} [09:20:10] (03PS3) 10Brion VIBBER: Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) [09:20:51] ok, that should utilize those two machines better when the low-prio queue is full (as it is now) [09:22:51] <_joe_> brion: the issue I see is that overcommitting can cause issues [09:22:55] <_joe_> but I mean, we can try [09:23:02] *nod* [09:23:25] _joe_: want to split the difference? 6/6 and 10/10 instead of 8/8 and 12/12? [09:23:39] <_joe_> brion: I need to think about it a bit [09:23:44] ok [09:25:31] <_joe_> brion: actually, I'd bump up the low-priority jobs only. The 99th percentile of wait for the high-priority ones is 2 minutes [09:25:42] true [09:25:46] ok lemme tweak it [09:25:47] <_joe_> and has been below 20 minutes since the time it went to production [09:28:50] (03CR) 10Hashar: "apt::repository should always be applied, the ensure would be either present or absent." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:28:56] <_joe_> brion: thanks :) given the time of the day there, I can take care of it if you want [09:29:23] _joe_: i'm actually in london atm for a meeting :) [09:29:26] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3028596 (10Gehel) a:05Cmjohnson>03Gehel [09:29:32] <_joe_> ohhh I see [09:29:42] <_joe_> I was worried for your sleep cycle :) [09:30:27] (03PS1) 10Madhuvishy: tools: Read list of tools for precise email reminder from precise-tools dashboard [puppet] - 10https://gerrit.wikimedia.org/r/337787 [09:31:44] I have a question for general root/puppet assistance. I would like to be able to spawn multiple Jenkins in parallel and more or less isolate them from each others [09:31:45] (03CR) 10jerkins-bot: [V: 04-1] tools: Read list of tools for precise email reminder from precise-tools dashboard [puppet] - 10https://gerrit.wikimedia.org/r/337787 (owner: 10Madhuvishy) [09:31:56] (03PS4) 10Brion VIBBER: Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) [09:32:42] I am wrapping the java process with systemd. What I am struggling with is how to keep multiple instances separated; I thought of using different unix username and generate different systemd unit/service + different unix username + different paths [09:33:29] eg: jenkins-endtoend.service running as jenkins-endtoend user with /var/lib/jenkins-endtoend /var/log/jenkins-endtoend etc [09:33:52] (03PS2) 10Madhuvishy: tools: Read list of tools for precise email reminder from precise-tools dashboard [puppet] - 10https://gerrit.wikimedia.org/r/337787 [09:34:22] but that looks messy :/ [09:34:27] <_joe_> hashar: I think what you want is to define systemd instances [09:34:36] <_joe_> see what we did for carbon-relay [09:37:43] Description=carbon-cache (instance %i) [09:37:43] PartOf=carbon.service [09:37:48] _joe_: that looks promising :) [09:38:24] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3028611 (10Marostegui) After a chat with Jaime we have moved those old databases in labsdb1005 to: `labsdb1005:/srv/tmp/old_dbs` . They didn't have an... [09:38:36] (03CR) 10Muehlenhoff: "Yes, we want both suites to have the same priority. Otherwise you always have to fiddle with priorities/selections to explicity pull in th" [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:39:47] _joe_: then all instances are still running with the same user? [09:40:00] <_joe_> hashar: in theory, yes [09:40:07] <_joe_> hashar: what's the problem? [09:40:23] I don't know :} I am over thinking probably [09:40:25] <_joe_> hashar: better, what are you trying to do? [09:41:08] we want to investigate splitting the current huge Jenkins in multiple instances that are easier to manage / upgrade at any time [09:41:40] so the current contint1001 / contint2001 would eventually (we are not sure yet) end up with several jenkins running in parallel having different set of plugins and credentials [09:42:03] (03PS5) 10Filippo Giunchedi: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:42:25] <_joe_> I'm pretty sure we can work something out; why the different users? [09:42:26] (03CR) 10Filippo Giunchedi: Make the experimental archive section generally available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [09:42:31] maybe I should just do a first pass that uses different systemd instances and figure out later whether we want to isolate the master from each other (via different unix users / cgroups / firejail or whatever) [09:42:47] <_joe_> that would be my suggestion [09:42:52] different users because a couple decades ago that is how one would lamely isolate two process from each others [09:43:02] with mode 0640 on files [09:43:33] then eventually I have discovered/read the doc for fire jail yesterday and I think that is something we will want for Jenkins as well :} [09:43:44] 06Operations: Some mw hosts trigger a dpkg conffile prompt when upgrading php-pear - https://phabricator.wikimedia.org/T154007#3028621 (10MoritzMuehlenhoff) I found the culprit: https://phabricator.wikimedia.org/P4936 pear.conf was actually a red herring. The conffile prompt gets triggered by /etc/php5/cli/php... [09:44:41] hashar: thanks for the review btw! [09:44:46] hashar: jenkins is definitely a prime candidate, let me know if I can help with anything [09:45:12] but better migrate to the 2.32 LTS first, otherwise we'll need to adapt firejail changes needlessly [09:45:13] re: jenkins what each instance would contain? IOW under which "axis" you will split the instances? [09:45:14] moritzm: I need a brain chip for opsec 101 ? :} [09:45:45] I have yet to write the .plan for what I want to achieve. But on top of my brain the idea would be to have: [09:46:09] when the new LTS is up, we can have a look at jenkins tigether and pick the firejail containments that make sense for it [09:46:11] * couple jenkins master in active/active that will handle all the CI jobs that get triggered from Differential/Gerrit/Zuul. Eg the ones that ends up voting +1/-1 [09:46:29] * a jenkins solely for the beta cluster, to keep it updated, act as a central cron [09:46:59] * a jenkins to drive the end-to-end tests / browser tests which have some sensible credentials and need nice dashboard reporting [09:47:24] (03PS1) 10Elukey: Update the zookeeper module [puppet] - 10https://gerrit.wikimedia.org/r/337792 (https://phabricator.wikimedia.org/T157968) [09:47:46] * and maybe later a couple private Jenkins to release .deb / tarballs and another one to drive scap (that is an utopia) [09:48:54] (all of that would probably require to puppetize the Jenkins .xml config files which is going to be an interesting challenge :D) [09:50:47] (03PS1) 10Joal: Add new fields to archive_p view in labsdb [puppet] - 10https://gerrit.wikimedia.org/r/337793 (https://phabricator.wikimedia.org/T155658) [09:51:32] hashar: interesting, if there's a task for that let us know [09:51:58] (03CR) 10Elukey: [C: 032] "No op https://puppet-compiler.wmflabs.org/5464/" [puppet] - 10https://gerrit.wikimedia.org/r/337792 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [09:52:00] godog: randomly experimenting. Wanna do a POC in labs before :} [09:54:59] and my next question is: how one can acknowledge an alarm in Icinga. Is that restricted solely to ops or anyone being the contact of a service can act on it ? [09:55:46] if the later, my login is "hashar" but the Icinga contact is "amusso" :/ [09:56:17] something like the latter I think, non-ops can ack alerts but I don't know exactly how that works [09:57:47] from the doc ( https://docs.icinga.com/latest/en/cgiauth.html ) it says: An authenticated contact is an authenticated user whose username matches the short name of a contact definition. [09:57:52] so I guess yeah name mismatch :} [10:00:16] 06Operations: Some mw hosts trigger a dpkg conffile prompt when upgrading php-pear - https://phabricator.wikimedia.org/T154007#3028652 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff [10:01:40] 06Operations, 10Ops-Access-Requests, 10Icinga, 10Monitoring, 06Release-Engineering-Team: Rename Icinga contact 'amusso' to 'hashar' - https://phabricator.wikimedia.org/T158167#3028702 (10hashar) [10:02:02] godog: ^ if you wanna mess up with Icinga contact list. I guess it is all about renaming my contact from 'amusso' to 'hashar' [10:02:16] to match the ldap account name I connect with [10:04:41] (03PS1) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [10:04:44] (03CR) 10Jcrespo: "Those fields are not in use on production (I blocked that), and they will be done properly (deleted) later in the year: https://www.mediaw" [puppet] - 10https://gerrit.wikimedia.org/r/337793 (https://phabricator.wikimedia.org/T155658) (owner: 10Joal) [10:07:22] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2954751 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1048.eqiad.wmnet'] ``` The... [10:09:25] (03PS3) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [10:11:43] (03PS4) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [10:11:45] (03PS1) 10Filippo Giunchedi: udp2log: fix mirroring of received packets [puppet] - 10https://gerrit.wikimedia.org/r/337798 (https://phabricator.wikimedia.org/T123728) [10:12:11] hashar: sorry I can ATM, trying to do too many things already :( [10:13:01] can't, even [10:13:01] (03PS1) 10Gehel: elasticsearch: adding new servers elastic1048-1052 [puppet] - 10https://gerrit.wikimedia.org/r/337800 (https://phabricator.wikimedia.org/T155790) [10:13:06] (03PS2) 10Filippo Giunchedi: deployment::server: enable jessie-wikimedia/experimental [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) [10:15:09] (03CR) 10Gehel: [C: 032] elasticsearch: adding new servers elastic1048-1052 [puppet] - 10https://gerrit.wikimedia.org/r/337800 (https://phabricator.wikimedia.org/T155790) (owner: 10Gehel) [10:18:45] (03PS2) 10Filippo Giunchedi: udp2log: fix mirroring of received packets [puppet] - 10https://gerrit.wikimedia.org/r/337798 (https://phabricator.wikimedia.org/T123728) [10:18:47] (03PS5) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [10:21:21] (03PS6) 10Filippo Giunchedi: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [10:21:46] (03CR) 10Filippo Giunchedi: Make the experimental archive section generally available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [10:22:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [10:23:54] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:04] PROBLEM - DPKG on mw2232 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:31:45] seems related to php-pear, Unable to lock the administration directory (/var/lib/dpkg/), is another process using it? [10:32:04] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 37 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [10:32:44] mmm a dpkg is holding /var/lib/dpkg/lock [10:33:22] (03PS3) 10Filippo Giunchedi: udp2log: fix mirroring of received packets [puppet] - 10https://gerrit.wikimedia.org/r/337798 (https://phabricator.wikimedia.org/T123728) [10:33:35] from last nobody seems working on it, will try to fix it [10:33:52] (maybe it is moritzm via salt?) [10:35:22] ah yes parent is salt-minion [10:35:31] so probably a race with puppet [10:36:03] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] udp2log: fix mirroring of received packets [puppet] - 10https://gerrit.wikimedia.org/r/337798 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:36:25] (03PS1) 10Jcrespo: mariadb: Depool db1045 and move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337802 (https://phabricator.wikimedia.org/T147747) [10:38:10] (03PS2) 10Jcrespo: mariadb: Depool db1045 and move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337802 (https://phabricator.wikimedia.org/T147747) [10:38:55] (03PS1) 10Gehel: jessie installs: adding rootdelay=90 to kernel options [puppet] - 10https://gerrit.wikimedia.org/r/337804 [10:42:17] (03CR) 10Gehel: "I'm not really sure why we have a rootdelay option for precise and trusty, but not for jessie. There might be a good reason. After reimagi" [puppet] - 10https://gerrit.wikimedia.org/r/337804 (owner: 10Gehel) [10:43:17] elukey: having a look [10:43:39] gehel: I think you've ran into T149845 if you want to add the bug to the code review [10:43:40] T149845: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845 [10:44:04] RECOVERY - DPKG on mw2232 is OK: All packages OK [10:44:09] elukey: that was also affected by https://phabricator.wikimedia.org/T154007, fixed [10:44:36] godog: yes, it does look similar! (to be honest, I don't understand much here...) [10:44:57] (03PS2) 10Gehel: jessie installs: adding rootdelay=90 to kernel options [puppet] - 10https://gerrit.wikimedia.org/r/337804 (https://phabricator.wikimedia.org/T149845) [10:46:11] gehel: heh I've ran into that too but couldn't think of anything obvious why it would behave that way, especially not on all hosts so looks like some kind of race [10:46:52] !log installing PHP security updates on siliver (running wikitech) [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:21] godog: I've been told that editing the grub comand line and adding a root delay would solve the issue (and I can confirm anecdotal evidence that it does). [10:47:50] and we seem to have that option in the trusty config. There is probably a good reason to not have it in Jessie, but I have no idea why... [10:49:16] 06Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10Gehel) I ran into the same issue when migrating elasticsearch servers to jessie. I was told to manually add "rootdelay" to the grub command line. It looks like we do have a rootdelay configured... [10:49:34] !log installing PHP security updates on uranium (running ganglia) [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:44] (03CR) 10Ema: Analytics VCL: default to 'org' if top_domain is not set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337549 (https://phabricator.wikimedia.org/T138027) (owner: 10Ema) [10:52:54] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:59:03] !log installing PHP security updates on californium (running horizon) [10:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:54] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:05:11] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3028812 (10elukey) Proposed fixes: ``` delete firewall family inet filter analytics-in4 term udplog delete firewall family inet filter analytics-in4 term prelabsdb-mysql delete firewall... [11:08:24] (03PS7) 10Filippo Giunchedi: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [11:10:11] (03CR) 10Filippo Giunchedi: [C: 032] Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [11:11:24] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:37] (03PS1) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [11:14:03] (03PS3) 10Filippo Giunchedi: deployment::server: enable jessie-wikimedia/experimental [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) [11:14:55] (03PS1) 10Hashar: contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 [11:15:14] (03CR) 10Filippo Giunchedi: [C: 032] deployment::server: enable jessie-wikimedia/experimental [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) (owner: 10Filippo Giunchedi) [11:17:28] (03PS1) 10Gehel: elasticsearch: add rack information for new servers elastic1048-1052 [puppet] - 10https://gerrit.wikimedia.org/r/337810 (https://phabricator.wikimedia.org/T155790) [11:18:50] (03PS2) 10Gehel: elasticsearch: add rack information for new servers elastic1048-1052 [puppet] - 10https://gerrit.wikimedia.org/r/337810 (https://phabricator.wikimedia.org/T155790) [11:20:27] !log upgrade git on tin/mira - T140927 [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:31] T140927: Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927 [11:24:25] (03Abandoned) 10Hashar: (DO NOT SUBMIT) octopus merge of Jenkins changes [puppet] - 10https://gerrit.wikimedia.org/r/337399 (owner: 10Hashar) [11:24:35] (03PS2) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [11:25:28] (03CR) 10jerkins-bot: [V: 04-1] varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [11:26:38] (03PS2) 10Hashar: contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 [11:26:40] (03PS3) 10Hashar: jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 [11:26:42] (03PS3) 10Hashar: jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 [11:26:44] (03PS3) 10Hashar: jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 [11:26:46] (03PS5) 10Hashar: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 [11:26:48] (03PS3) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 [11:26:50] (03PS3) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 [11:26:52] (03PS4) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [11:26:54] (03PS4) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [11:27:08] (03CR) 10Gehel: [C: 032] elasticsearch: add rack information for new servers elastic1048-1052 [puppet] - 10https://gerrit.wikimedia.org/r/337810 (https://phabricator.wikimedia.org/T155790) (owner: 10Gehel) [11:27:10] (03PS3) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [11:35:03] (03PS1) 10Jcrespo: Upgrade mariadb module to new template hierarchy (unused) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/337813 [11:35:46] !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1067.eqiad.wmnet [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] (03PS1) 10Jcrespo: mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 [11:38:07] !log Running pt-table-checksum on db1043 (m3 - phabricator master) - T154485 [11:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:12] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [11:38:26] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 (owner: 10Jcrespo) [11:38:34] (03PS3) 10Muehlenhoff: Only add the Diamond collector if ISC dhcpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) [11:38:36] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3028896 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1048.eqiad.wmnet'] ``` and were **ALL** successful. [11:40:24] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:40:57] (03CR) 10Jcrespo: [C: 032] Upgrade mariadb module to new template hierarchy (unused) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/337813 (owner: 10Jcrespo) [11:42:06] (03PS2) 10Jcrespo: mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 [11:47:42] (03PS4) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [11:49:15] (03PS1) 10Gehel: elasticsearch: add new servers to cluster and to LVS (elastic1048-1052) [puppet] - 10https://gerrit.wikimedia.org/r/337816 (https://phabricator.wikimedia.org/T155790) [11:49:25] (03PS3) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [11:50:35] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3028943 (10Joe) >>! In T156922#3012431, @fgiunchedi wrote: >>>!... [11:54:43] (03PS1) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827 [11:59:36] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/5474/" [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [12:01:00] (03PS1) 10Jcrespo: HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 [12:04:13] (03PS3) 10Jcrespo: mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 [12:04:15] (03PS2) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827 [12:04:17] (03PS2) 10Jcrespo: HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 [12:04:19] (03PS1) 10Jcrespo: mariadb-sanitarium: move custom init.d under the mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/337835 [12:06:20] (03PS3) 10Hashar: contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 [12:06:22] (03PS4) 10Hashar: jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 [12:06:24] (03PS4) 10Hashar: jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 [12:06:26] (03PS4) 10Hashar: jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 [12:06:28] (03PS6) 10Hashar: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 [12:06:30] (03PS4) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 [12:06:32] (03PS4) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 [12:06:34] (03PS5) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [12:06:36] (03PS5) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [12:06:38] (03PS1) 10Hashar: jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836 [12:06:40] (03PS1) 10Jcrespo: Remove the templates dir, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/337837 [12:12:17] (03CR) 10Hashar: [C: 04-1] jenkins: migrate to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [12:16:24] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) [12:16:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1045 and move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337802 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [12:18:48] (03Merged) 10jenkins-bot: mariadb: Depool db1045 and move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337802 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [12:18:57] (03CR) 10jenkins-bot: mariadb: Depool db1045 and move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337802 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [12:29:47] (03CR) 10jerkins-bot: [V: 04-1] jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836 (owner: 10Hashar) [12:29:53] (03PS1) 10Jcrespo: Increse the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) [12:31:09] (03PS2) 10Jcrespo: Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) [12:33:53] (03Abandoned) 10Joal: Add new fields to archive_p view in labsdb [puppet] - 10https://gerrit.wikimedia.org/r/337793 (https://phabricator.wikimedia.org/T155658) (owner: 10Joal) [12:33:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 (duration: 00m 42s) [12:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:26] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10MoritzMuehlenhoff) [12:41:58] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) (owner: 10Urbanecm) [12:52:27] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:56:53] !log restart of jmxtrans on all the analytics kafka brokers [12:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:03] (03PS1) 10Hashar: Revert "ldap: Add warning to ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) [13:07:13] (03PS2) 10Hashar: Revert "ldap: Add warning to ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) [13:08:56] (03CR) 10Hashar: "I understand the need to remove ldapsupportlib.py but at the same time we are still using ldaplist until a replacement is found/written. " [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar) [13:10:14] (03PS2) 10Dereckson: Add throttle rule for Royal College of Nursing event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) (owner: 10Urbanecm) [13:10:30] * Dereckson is going to deploy this (as the event already started) ^ [13:11:33] (03CR) 10Dereckson: [C: 032] "Urgent deployment (event already started)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) (owner: 10Urbanecm) [13:12:48] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3029048 (10mark) Approved. [13:12:57] (03Merged) 10jenkins-bot: Add throttle rule for Royal College of Nursing event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) (owner: 10Urbanecm) [13:13:08] (03CR) 10jenkins-bot: Add throttle rule for Royal College of Nursing event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337839 (https://phabricator.wikimedia.org/T158171) (owner: 10Urbanecm) [13:13:12] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3029052 (10mark) a:05mark>03RobH [13:16:40] !log dereckson@tin Synchronized wmf-config/throttle.php: Throttle rule for Royal College of Nursing event (T158171) (duration: 00m 43s) [13:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:44] T158171: Lift registration cap from an IP for event on 15 Feb - https://phabricator.wikimedia.org/T158171 [13:20:27] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:23:32] (03CR) 10Hashar: "About the static UID/GID Daniel Zahn would know. A few gotchas:" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [13:26:55] (03CR) 10Marostegui: [C: 031] Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [13:27:10] jouncebot: next [13:27:10] In 0 hour(s) and 32 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T1400) [13:28:50] dcausse: good afternoon. I CR+2 your CirrusSearch patch in preparation of the swat :} [13:28:59] so we would not have to wait for the tests to complete [13:31:18] hashar: thanks! [13:34:00] (03CR) 10Marostegui: [C: 031] "This looks good: https://puppet-compiler.wmflabs.org/5476/" [puppet] - 10https://gerrit.wikimedia.org/r/337814 (owner: 10Jcrespo) [13:37:28] (03CR) 10DCausse: [C: 031] elasticsearch: add new servers to cluster and to LVS (elastic1048-1052) [puppet] - 10https://gerrit.wikimedia.org/r/337816 (https://phabricator.wikimedia.org/T155790) (owner: 10Gehel) [13:41:42] (03CR) 10Ottomata: [C: 031] Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [13:44:21] (03PS2) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [13:46:39] (03CR) 10Gehel: [C: 032] elasticsearch: add new servers to cluster and to LVS (elastic1048-1052) [puppet] - 10https://gerrit.wikimedia.org/r/337816 (https://phabricator.wikimedia.org/T155790) (owner: 10Gehel) [13:51:28] (03CR) 10Giuseppe Lavagetto: "I like it a lot in general, most of my comments are either things I wasn't so sure about or smaller details. The overall architecture seem" (0341 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [13:53:33] elukey: as you can see joe and me are even now :-P [13:53:40] ahahahah [13:53:47] I was about to write something about it :P [13:57:11] :D [13:57:35] (03CR) 10Jcrespo: "Hashar- we should definitely continue this conversation on a separate ticket, a parent ticket of T100501." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [13:59:27] !log disabled mod_deflate on bohrium (piwik) and disabled puppet. Testing 503 reduction. [13:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T1400). Please do the needful. [14:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:17] o/ [14:00:19] dcausse: lets unleash it. Wanna test it on mwdebug1001 first ? [14:00:47] hashar:I can't it's only maintenance code [14:00:53] ;) [14:01:06] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029214 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1049.eqiad.wmnet'] ``` The... [14:01:12] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029215 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1050.eqiad.wmnet'] ``` The... [14:01:13] hroaiea [14:01:17] the patch hasn't landed [14:01:31] :/ [14:01:39] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029216 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1051.eqiad.wmnet'] ``` The... [14:01:40] (03PS3) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [14:01:45] 0/ [14:01:54] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029217 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1052.eqiad.wmnet'] ``` The... [14:02:14] hashar, dcausse: the two of you are in charge of swat today? [14:02:24] Time: 12.2 minutes, Memory: 1706GB [14:02:34] oO [14:02:40] just in time [14:03:52] dcausse: sync in progress [14:03:58] ok [14:04:03] hopefully the VERSION bump is not going to cause any havoc [14:04:34] !log hashar@tin Synchronized php-1.29.0-wmf.12/extensions/CirrusSearch/includes/Maintenance/SuggesterAnalysisConfigBuilder.php: Fold some problematic whitespaces with completion - T156234 (duration: 00m 48s) [14:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:42] T156234: ICU folding seems to cause issues with completion - https://phabricator.wikimedia.org/T156234 [14:05:24] hashar: thanks! will test a small script on testwiki to be sure, I'll swat fix this evening if I run into troubles [14:05:35] marostegui: jynus : we have a bunch of hhvm spam log such as : SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT('0-171970704-5549940242', 60) [14:05:46] I guess something is waiting for replication [14:05:47] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:06:16] hashar: let me see [14:06:21] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 567.71 seconds [14:06:46] well, there you have it [14:06:48] ^ that could be it [14:06:49] yep [14:06:49] XD [14:06:50] marostegui: spotted them on fluorine in the hhvm.log or fatalmonitor [14:07:14] I think it crashed [14:07:17] uptime is 10 m [14:07:23] dcausse: how your backport is for .12 and everything is still on wmf.11 [14:07:23] yeah and repl is stopped [14:07:31] dcausse: s/how/ohhh/ [14:07:33] do not start yet [14:07:39] nope [14:07:41] I will depool it [14:07:59] the machine didn't crash [14:08:08] the poor Icinga slave_sql_lag check is a bit laggy :) [14:08:09] hashar: oh... wmf12 was cancelled yesterday? [14:08:10] no, only mysql [14:08:20] I will depool now [14:08:26] I am doing it [14:08:30] ok [14:08:30] dcausse: apparently. According to wikiversion.json and http://tools.wmflabs.org/versions/ [14:08:38] I will keep an eye on the other server [14:08:43] in case this is a mediawiki issue [14:08:46] dcausse: if you backport it to wmf.11 I don't mind deploying it :) [14:09:03] dcausse: hopefully it is just about pressing cherry-pick in Gerrit, CR+2 and then scap sync-file :) [14:09:29] hashar: ok if you don't mind I'd be happy to deploy it on wmf11 :) [14:09:45] cherry-picking [14:10:02] [ERROR] InnoDB: Tried to read 16384 bytes at offset 28488318976 [14:10:02] . Was only able to read 0. [14:10:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337846 [14:10:09] 2017-02-15 13:56:41 7fdf53d9d700 InnoDB: Operating system error number 5 in a [14:10:09] file operation. [14:10:12] jynus: ^ [14:10:15] InnoDB: Error number 5 means 'Input/output error'. [14:10:24] [11085714.884722] blk_update_request: I/O error, dev sda, sector 4627225248 [14:10:39] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337846 (owner: 10Marostegui) [14:10:39] looks like storage crashed [14:10:46] dcausse: we can do all the cherry pick / merge dance. I will deploy once the databases side is clear [14:10:50] raid controller again? [14:11:20] hashar: sure, https://gerrit.wikimedia.org/r/#/c/337845/ [14:11:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337846 (owner: 10Marostegui) [14:11:51] (03PS4) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [14:11:55] (03PS10) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [14:13:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337846 (owner: 10Marostegui) [14:13:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337846 (owner: 10Marostegui) [14:14:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 (duration: 00m 44s) [14:14:45] hashar: I am done [14:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:23] ACKNOWLEDGEMENT - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1107.86 seconds Marostegui looks like storage crashed, we are investigating [14:15:30] /msg marostegui An_ill_database whispers you "thank you!" [14:15:42] I do not see logs on the console [14:15:48] I will check the kernel [14:15:49] hashar: XDD [14:16:16] jynus: yep, the dmesg is quite clear [14:16:30] but we should also have some logs on the idra/ilo :( [14:16:37] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:54] [Wed Feb 15 14:00:19 2017] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [14:17:08] it is a block devide failure, that is for sure [14:17:13] not a mysql error [14:17:15] yeah [14:17:21] the disks are showing all fine [14:17:24] blk_update_request: I/O error, dev sda, sector 4627225248 [14:17:25] (which can be a lie) [14:20:11] it was definitely the storage [14:22:47] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:23:43] no logs on the Ilo [14:24:08] no :( [14:24:11] I would file a ticket with all information, stop the server [14:24:24] restart it, confirm it is ok, pool it back [14:24:30] i will take care of that [14:24:57] thank you, then [14:25:11] ack and disable alerts while it, so it does not page anymore [14:25:18] i did the ack already [14:25:21] will disable alerts [14:25:41] I mean everywhere for reboot [14:25:50] yes yes :) [14:26:20] there is a slight change some block is faulty, but that would be realy strange on RAID10 [14:26:49] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029251 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1052.eqiad.wmnet'] ``` and were **ALL** successful. [14:26:59] dcausse: CI is a bit busy. I will poke you when I am about to deploy the CirrusSearch patch for wmf.11 [14:27:16] hashar: sure [14:28:01] PROBLEM - MariaDB Slave Lag: s2 on db1060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 424.22 seconds [14:28:17] another one…let's check [14:28:47] this is s2 [14:28:58] lag is coming down [14:29:37] what is happening, did we deployed code yesterday? [14:30:30] looks so as per deployments page [14:31:29] based on https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1060&from=1487147462910&to=1487169062910 [14:31:36] it may have a faulty disk [14:31:49] or any other issue creating lag [14:32:10] maybe now it is the time load increases and latent problems show up [14:32:40] I will give a look at db1060, continue with db1080 [14:32:45] ok [14:32:55] 82 [14:32:56] I mean [14:32:58] RECOVERY - MariaDB Slave Lag: s2 on db1060 is OK: OK slave_sql_lag Replication lag: 39.90 seconds [14:33:00] 06Operations, 10DBA: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029269 (10Marostegui) [14:33:53] In both cases, no users should have been afffected, automatic depool seemed to work nicely [14:34:46] !log Stop MySQL and shutdown db1082 - T158188 [14:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] T158188: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188 [14:35:51] on db1060 I see a couple of disks with media errors [14:35:59] but no smart alerts [14:36:53] PROBLEM - Elasticsearch HTTPS on elastic1052 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against elastic1052.eqiad.wmnet [14:37:08] (03PS1) 10Muehlenhoff: Remove gehel from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/337850 [14:37:33] I do not see anomalous queries [14:37:36] I am watching db1082 reboot via ilo and so far no complains about anything hardware related [14:37:38] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/337850 (owner: 10Muehlenhoff) [14:38:10] but I see contention caused by binlog [14:38:52] IO write time is anomaly high [14:40:19] (03PS1) 10Jgreen: rename fundraising db1008 to frav1001 and change its IP [dns] - 10https://gerrit.wikimedia.org/r/337851 [14:40:46] I think I am going to offline 1 or 2 disks [14:41:39] ok [14:42:22] Adapter: 0: EnclId-32 SlotId-4 state changed to OffLine. [14:42:26] I will wait [14:43:13] and if it doesn't improve, do 32:7 [14:44:33] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:44:54] (03CR) 10Filippo Giunchedi: [C: 04-1] "+1 on the general concept, some generic consistency checks can be moved to module/admin whereas wmf-specific and ldap checks can stay here" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [14:45:14] Adapter: 0: EnclId-32 SlotId-7 state changed to OffLine. [14:45:29] !log offlined 2 disks with media + other errors on db1060 [14:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:16] now it is getting worse... [14:46:46] the lag is coming back? [14:47:09] it is not statisticaly significant yet [14:47:14] it is now going down [14:47:18] [6488675.728739] scanning ... [14:47:20] but we have to wait [14:47:24] might be because of the RAID rescan [14:49:09] I do not think it worked [14:49:22] 06Operations, 10DBA: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029289 (10Marostegui) Server rebooted fine it showed this on dmesg which I am not completely aware of what it means : ``` [ 32.823256] hpsa 0000:08:00.0: Acknowledging event: 0xc0000000 (HP SSD Smart Path configuration ch... [14:49:27] (03PS2) 10Jgreen: rename fundraising db1008 to frav1001 and change its IP [dns] - 10https://gerrit.wikimedia.org/r/337851 [14:50:33] jynus: I think I know what might be happening [14:50:46] 82 or 60? [14:50:49] 60 [14:50:57] please tell! [14:51:32] it is going to page again [14:51:34] (03CR) 10Cmjohnson: [C: 031] rename fundraising db1008 to frav1001 and change its IP [dns] - 10https://gerrit.wikimedia.org/r/337851 (owner: 10Jgreen) [14:51:43] root@db1060:~# megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:" [14:51:46] Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU [14:51:51] BBU looks gone and it went to writethrough [14:52:11] I can force it to go to writeback and that should help but if we have a power issue we might lose data [14:52:13] yeah I don't think we have a check for that :/ [14:52:17] ah [14:52:21] did you check whether the battery is training? [14:52:24] forgot the simplest expanation [14:52:27] it is dead I believe [14:52:28] look [14:52:36] paravoid, I disabled that for all dbs [14:52:40] that is often a source of problems, megaraid frequently retrains the battery [14:52:43] oh [14:52:45] nevermind then :) [14:53:34] https://phabricator.wikimedia.org/P4937 [14:54:18] jynus: let's force WriteBack to see if that helps [14:54:26] the altet used to show that [14:54:39] maybe for another vendor, and I am mixing things [14:54:47] (03PS6) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [14:54:49] (03PS1) 10Filippo Giunchedi: xenon: add apache 2.4 conditional for access control [puppet] - 10https://gerrit.wikimedia.org/r/337855 (https://phabricator.wikimedia.org/T123728) [14:54:58] the HP checks includes BBU checks [14:55:17] I don't think we have anything checking BBUs or battery training etc. for megaraid [14:55:18] RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:55:26] these are all from memory, both could be incorrect [14:55:47] I have forced WriteBack [14:55:53] and the lag is gone [14:56:02] <_joe_> dell megaraid definitely exposed the battery status for the raid [14:56:09] <_joe_> to linux [14:56:17] I will open a ticket to get the BBU changed [14:56:19] on the alert [14:56:57] Auto-Learn Mode: Disabled [14:57:11] Relative State of Charge: 18 % [14:57:18] (03PS2) 10Filippo Giunchedi: xenon: add apache 2.4 conditional for access control [puppet] - 10https://gerrit.wikimedia.org/r/337855 (https://phabricator.wikimedia.org/T123728) [14:57:23] Learn Cycle Requested : Yes ? [14:57:46] Battery State: Degraded [14:58:39] yes, it was that [14:58:59] but now I am not sure I want to reenable disks with media errors [14:59:17] I would leave them out for now [15:00:03] PROBLEM - MegaRAID on db1060 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:00:04] ACKNOWLEDGEMENT - MegaRAID on db1060 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T158193 [15:00:08] 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3029364 (10ops-monitoring-bot) [15:00:31] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] "CI backed up" [puppet] - 10https://gerrit.wikimedia.org/r/337855 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:00:40] 06Operations, 10ops-eqiad, 10DBA: Replaced BBU for db1060 - https://phabricator.wikimedia.org/T158194#3029382 (10Marostegui) p:05Triage>03High [15:00:52] jynus task created ^ [15:01:04] thanks [15:01:31] it says a manual learn is required [15:01:39] should we force one while depooled? [15:02:07] I tried the manual learn myself to see if it would complain about the BBU failing [15:03:17] * Jeff_Green is confused by gerrit again... [15:03:19] yeah, we can do that once we depool it to get it replaced [15:03:57] can anyone tell me why https://gerrit.wikimedia.org/r/#/c/337851/ does not seem to have the usual button to merge? [15:04:54] Jeff_Green, probably jenkins (I think it dns has CI) failed to be added [15:04:57] Jeff_Green: maybe because jenkins bot has not verified it yet? [15:05:06] (03CR) 10Muehlenhoff: [V: 032 C: 032] Remove gehel from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/337850 (owner: 10Muehlenhoff) [15:05:06] Jeff_Green: no C+2, no V+2 [15:05:14] (03PS2) 10Muehlenhoff: Remove gehel from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/337850 [15:05:30] did jenkins croak? [15:05:49] 06Operations, 06Labs: Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#3029409 (10chasemp) [15:05:50] you can add it manually, I had to do it on my last 4 patches [15:05:55] (03CR) 10Muehlenhoff: [V: 032 C: 032] Remove gehel from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/337850 (owner: 10Muehlenhoff) [15:05:58] ah, wacky [15:06:14] just add jenkins-bot right? [15:06:22] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs, and 2 others: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#3029424 (10chasemp) 05Open>03Resolved closed in favor of T158196 [15:06:24] that is what I do [15:06:51] 06Operations, 10ops-eqiad, 06DC-Ops: elastic1051 not booting from PXE - https://phabricator.wikimedia.org/T158197#3029430 (10Gehel) [15:07:55] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#3029452 (10chasemp) [15:07:57] 06Operations, 07Wikimedia-Incident: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#3029448 (10chasemp) 05Open>03declined closing in favor of T158196 [15:08:01] 06Operations, 07Wikimedia-Incident: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#3029455 (10chasemp) 05Open>03declined closing in favor of T158196 [15:08:04] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004225 (10chasemp) [15:08:13] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004225 (10chasemp) [15:08:16] 06Operations, 06Labs, 07Wikimedia-Incident: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#3029461 (10chasemp) 05Open>03declined closing in favor of T158196 [15:08:17] ok. so now in theory jenkins will see this and review it right? [15:08:22] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004226 (10chasemp) [15:09:56] Jeff_Green, maybe, you have to pray to the god of "continuous integration and virtual machines" [15:10:25] * Jeff_Green goes looking for a goat to sacrifice.... [15:11:26] Jeff_Green: yesterday it took me around 10 minutes to get the change through after adding the bot :-) [15:11:26] and rebase if it doesn't respond, somtimes it gets "stuck", when someone sends 1 million jobs at the same time [15:11:57] ok, good to know [15:12:12] you can peek at https://integration.wikimedia.org/zuul/ too to see queue status [15:13:00] yeah, it is a bit overloaded right nwo [15:13:08] Zuul gate pipeline looks odd yeah [15:15:50] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029491 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1050.eqiad.wmnet'] ``` and were **ALL** successful. [15:20:43] (03PS1) 10Jcrespo: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 [15:20:45] (03PS4) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [15:21:03] (03CR) 10Jcrespo: [C: 04-2] "Wait until buffer pool is hot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo) [15:22:53] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:11] (03PS5) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [15:24:48] (03PS4) 10Jcrespo: mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 [15:26:25] (03PS6) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [15:26:48] hashar: ^ I think jenkins/zuul are in a bit of trouble [15:27:16] what is happening? [15:28:26] bah hours to get changes in [15:28:27] looking [15:28:40] (03PS7) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [15:30:04] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:24] PROBLEM - Elasticsearch HTTPS on elastic1050 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against elastic1050.eqiad.wmnet [15:30:30] (03PS1) 10ArielGlenn: tiny util to get last revision id from bz2 xml content dump file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/337863 [15:31:04] PROBLEM - Elasticsearch HTTPS on elastic1049 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against elastic1049.eqiad.wmnet [15:32:01] (03PS8) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [15:32:29] the jobs in gate-and-submit keep being cancelled for some reason [15:33:48] there are some jobs that failed [15:33:51] (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) (owner: 10Ottomata) [15:33:53] so that cancel all the changes behin [15:33:53] d [15:37:48] !log stopping slave and repartitioning db1045 [15:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:56] jynus: asking out of caution, I wanted to merge https://gerrit.wikimedia.org/r/#/c/337568/ in mediawiki-config, no problem in doing so? [15:42:55] (03PS5) 10Giuseppe Lavagetto: Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) (owner: 10Brion VIBBER) [15:43:30] \o/ :D [15:43:39] godog, why asking me, what is dangerous aside from the obvious? [15:44:04] or you just mean as a mediawiki deployment, not db-related? [15:44:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) (owner: 10Brion VIBBER) [15:44:45] jynus: I saw you were merging mediawiki-config changes earlier plus the db pages [15:44:55] !log Zuul reducing gate-and-submit minimum amount of changes to process from the wrong 12 down to 2. In case of repeating failures it would end up running jobs for only two jobs which would prevent cancelling jobs for up to 11 changes! [15:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:07] ah, no [15:45:08] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#3029607 (10chasemp) [15:45:10] just a depool [15:45:16] Jeff_Green: chasemp faulty configuration of Zuul that have been around for years :( fix is https://gerrit.wikimedia.org/r/337865 zuul: fix window-floor which is in changes not jobs [15:45:19] not something we normally do [15:45:24] just go on [15:45:38] (03CR) 10Filippo Giunchedi: [C: 032] Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:45:58] we will repool much later, buffer pool takes a lot to reaheat [15:46:53] hashar: patch merged on wmf11, should I deploy now or is it safe to wait? [15:47:16] dcausse: looks good to do it now [15:47:27] hashar: ok [15:48:13] (03PS2) 10Filippo Giunchedi: Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) [15:48:15] (03PS2) 10Filippo Giunchedi: Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) [15:48:17] dcausse: sorry looks like the CI gate had some troubles :/ [15:48:20] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#3029622 (10chasemp) 05stalled>03Open [15:48:30] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3029623 (10chasemp) 05stalled>03Open [15:48:41] hashar: np [15:48:50] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10chasemp) >>! In T154664#2919959, @RobH wrote: > I've chatted with Chase about this via IRC. The one host to work as a labvirt/nova/neutron host will be for small VM testing,... [15:48:55] godog: you're about to scap something right? [15:49:12] dcausse: correct [15:49:17] (03CR) 10Muehlenhoff: Add account validation script / cron job (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [15:49:29] (03PS4) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [15:49:49] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:49:58] godog: ok I'll wait, but you may see 2 patches after git fetch I suppose [15:50:01] (03CR) 10jenkins-bot: Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:50:05] (03CR) 10Marostegui: [C: 031] "If you can, close the task once it is repooled. If it happens again we have it for reference." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo) [15:50:54] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:51:38] (03PS3) 10Ottomata: Set hue allowed_hosts=* to work around bug http://community.cloudera.com/t5/Web-UI-Hue-Beeswax/New-Cloudera-installation-Hue-Bad-Request-400/td-p/50344/page/5 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336906 (https://phabricator.wikimedia.org/T152714) [15:51:42] !log filippo@tin Synchronized wmf-config/StartProfiler.php: Switch xenon redis to mwlog1001.eqiad.wmnet (duration: 00m 42s) [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:02] dcausse: ack, no I didn't see your changes though, all yours [15:52:04] (03CR) 10Ottomata: [V: 032 C: 032] Set hue allowed_hosts=* to work around bug http://community.cloudera.com/t5/Web-UI-Hue-Beeswax/New-Cloudera-installation-Hue-Bad-Request-400 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336906 (https://phabricator.wikimedia.org/T152714) (owner: 10Ottomata) [15:52:13] godog: ok, thanks [15:54:21] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3029672 (10chasemp) [15:54:40] (03PS1) 10Ottomata: Set timeouts on various hdfs puppet execs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/337866 (https://phabricator.wikimedia.org/T130832) [15:56:22] (03PS6) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [15:56:50] (03PS2) 10Ottomata: Set timeouts on various hdfs puppet execs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/337866 (https://phabricator.wikimedia.org/T130832) [15:57:21] (03CR) 10Ottomata: [V: 032 C: 032] Set timeouts on various hdfs puppet execs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/337866 (https://phabricator.wikimedia.org/T130832) (owner: 10Ottomata) [15:57:23] !log (Old action but for the sake of getting it logged) Force RAID controller to work on WriteBack even with the broken BBU it has now on db1060 so it can keep up with the replication thread - T158194 [15:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:27] T158194: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194 [15:57:36] (03CR) 10Volans: "Thanks a lot _joe_ for the review!" (0341 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:59:23] gone for several hours, hope to be back for archcomm [16:00:08] (03PS1) 10Ottomata: Update cdh module with hue fix and timeouts on hdfs execs [puppet] - 10https://gerrit.wikimedia.org/r/337867 [16:00:45] (03CR) 10Ottomata: [V: 032 C: 032] Update cdh module with hue fix and timeouts on hdfs execs [puppet] - 10https://gerrit.wikimedia.org/r/337867 (owner: 10Ottomata) [16:01:47] !log dcausse@tin Synchronized php-1.29.0-wmf.11/extensions/CirrusSearch/: T156234: Fold some problematic whitespaces with completion (duration: 01m 01s) [16:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:52] T156234: ICU folding seems to cause issues with completion - https://phabricator.wikimedia.org/T156234 [16:05:12] (03PS5) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [16:06:11] (03PS5) 10Jcrespo: mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 [16:06:29] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move grants and mysqld config files to the role [puppet] - 10https://gerrit.wikimedia.org/r/337814 (owner: 10Jcrespo) [16:08:05] (03PS1) 10Jcrespo: Revert "mariadb: Move grants and mysqld config files to the role" [puppet] - 10https://gerrit.wikimedia.org/r/337868 [16:08:13] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: Move grants and mysqld config files to the role" [puppet] - 10https://gerrit.wikimedia.org/r/337868 (owner: 10Jcrespo) [16:09:42] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Move grants and mysqld config files to the role"" [puppet] - 10https://gerrit.wikimedia.org/r/337869 [16:10:02] (03CR) 10Jcrespo: "The patch is ok, but there is something wrong about the submodule update." [puppet] - 10https://gerrit.wikimedia.org/r/337869 (owner: 10Jcrespo) [16:10:34] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:34] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:13:50] (03PS9) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [16:16:35] !log T155120: restarting Cassandra on restbase1007-a to enable Prometheus exporter (canary) [16:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:41] T155120: Enable Prometheus metrics export for Cassandra - https://phabricator.wikimedia.org/T155120 [16:17:04] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029770 (10chasemp) [16:17:26] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3029771 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1051.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1051.eqi... [16:17:42] (03PS2) 10Jcrespo: Revert "Revert "mariadb: Move grants and mysqld config files to the role"" [puppet] - 10https://gerrit.wikimedia.org/r/337869 [16:18:00] (03PS7) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [16:21:58] (03CR) 10Jgreen: [C: 032] rename fundraising db1008 to frav1001 and change its IP [dns] - 10https://gerrit.wikimedia.org/r/337851 (owner: 10Jgreen) [16:22:11] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3029812 (10MoritzMuehlenhoff) [16:22:13] 06Operations: /etc/localtime should be a symbolic link - https://phabricator.wikimedia.org/T157795#3029810 (10MoritzMuehlenhoff) 05Open>03Resolved I've looked into this and this; it is harmless and the default behaviour in jessie. tzdata was changed to create a symlink on new installations (https://bugs.debi... [16:22:34] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] "CI backed up" [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [16:23:59] !log authdns-update to deploy fundraising host rename db1008->frav1001 [16:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:02] (03PS3) 10Jcrespo: Revert "Revert "mariadb: Move grants and mysqld config files to the role"" [puppet] - 10https://gerrit.wikimedia.org/r/337869 [16:24:52] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3029816 (10Marostegui) For the backup data: es1017 looks like a good candidate: ``` marostegui@es1017:~$ df -hT /srv Filesystem Type Size... [16:25:36] (03CR) 10BBlack: [C: 031] varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [16:26:14] (03CR) 10Faidon Liambotis: "That would add it to all boots, not just the initial boot. A delay may be alleviating your issues but do you have any sense of what's actu" [puppet] - 10https://gerrit.wikimedia.org/r/337804 (https://phabricator.wikimedia.org/T149845) (owner: 10Gehel) [16:28:47] (03CR) 10Jcrespo: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/5478/" [puppet] - 10https://gerrit.wikimedia.org/r/337869 (owner: 10Jcrespo) [16:32:59] (03PS4) 10Jcrespo: Revert "Revert "mariadb: Move grants and mysqld config files to the role"" [puppet] - 10https://gerrit.wikimedia.org/r/337869 [16:35:28] (03CR) 10Jcrespo: [C: 032] Revert "Revert "mariadb: Move grants and mysqld config files to the role"" [puppet] - 10https://gerrit.wikimedia.org/r/337869 (owner: 10Jcrespo) [16:36:49] _joe_: would you mind merging the systemd::syslog change I made to allow to change the rsyslog matcher ? https://gerrit.wikimedia.org/r/#/c/337411/3 :) [16:37:48] (03PS2) 10Jcrespo: mariadb-sanitarium: move custom init.d under the mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/337835 [16:39:02] !log flip xenon redis and apache from fluorine to mwlog1001 - T123728 [16:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:07] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [16:39:22] (03PS6) 10Ema: varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) [16:39:29] (03CR) 10Ema: [V: 032 C: 032] varnish: icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337808 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [16:40:18] (03PS3) 10Jcrespo: mariadb-sanitarium: move custom init.d under the mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/337835 [16:42:09] (03CR) 10Jcrespo: [C: 032] mariadb-sanitarium: move custom init.d under the mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/337835 (owner: 10Jcrespo) [16:43:37] (03PS3) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827 [16:43:44] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:46:02] ^fixing [16:47:02] 06Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10MoritzMuehlenhoff) Also happened on a range of mw servers when reimaging them to jessie (https://phabricator.wikimedia.org/T144911). Maybe let one of the local DC ops look at the initial system... [16:49:28] (03PS1) 10Ema: varnish: fix path to icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337872 [16:50:41] (03PS1) 10Jcrespo: mariadb: Fix typo on parsercache module after reorganization [puppet] - 10https://gerrit.wikimedia.org/r/337873 [16:50:44] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:34] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:34] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:10] (03CR) 10Ema: [V: 032 C: 032] varnish: fix path to icinga check for expiry mailbox lag [puppet] - 10https://gerrit.wikimedia.org/r/337872 (owner: 10Ema) [16:54:14] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:44] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3029891 (10thcipriani) >>! In T125735#3004660, @elukey wrote: > As you mentioned before, maybe 200ms of timeout for a Jobrunn... [16:56:01] (03PS2) 10Jcrespo: mariadb: Fix typo on parsercache && labs after template move [puppet] - 10https://gerrit.wikimedia.org/r/337873 [16:56:34] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag] [16:56:51] (03PS3) 10Jcrespo: mariadb: Fix typo on parsercache && labs after template move [puppet] - 10https://gerrit.wikimedia.org/r/337873 [16:57:01] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Fix typo on parsercache && labs after template move [puppet] - 10https://gerrit.wikimedia.org/r/337873 (owner: 10Jcrespo) [16:57:31] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:58:45] all right, usage is up on video scalers! looks good i think, though i can't ssh in on the network i'm on atm to check details :D [16:58:51] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:31] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:59:51] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 34 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag] [16:59:51] PROBLEM - Check Varnish expiry mailbox lag on cp2021 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined [17:00:34] looking ^ [17:00:41] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:00:51] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:01:01] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:01:41] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:01:51] PROBLEM - Check Varnish expiry mailbox lag on cp3034 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined [17:01:51] RECOVERY - Check Varnish expiry mailbox lag on cp2021 is OK: OK: expiry mailbox lag is 0 [17:02:12] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:02:31] PROBLEM - Check Varnish expiry mailbox lag on cp2007 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined [17:03:02] the varnish expiry errors should be fixed soon, sorry for the noise [17:03:11] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag] [17:03:21] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag] [17:04:11] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:04:21] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:04:31] RECOVERY - Check Varnish expiry mailbox lag on cp2007 is OK: OK: expiry mailbox lag is 0 [17:04:47] since we didn't get wmf.12 to group0 yesterday, and the blocking error got a patch/backport, I'm going to go ahead and push wmf.12 to group0 (after pushing the patch) [17:04:51] RECOVERY - Check Varnish expiry mailbox lag on cp3034 is OK: OK: expiry mailbox lag is 0 [17:05:01] !log starting wmf.12 to group0 [17:05:01] PROBLEM - Check systemd state on db2062 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:11] PROBLEM - Check whether ferm is active by checking the default input chain on db2062 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:05:41] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:05:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3029923 (10RobH) 05Open>03Resolved Thanks! I just wanted to make sure (and having it on a task now for reference makes future triage of new accounts easier.... [17:08:09] !log thcipriani@tin Synchronized php-1.29.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php: [[gerrit:337848|Make ChronologyProtector::init() use instanceof instead of empty()]] T158127 (duration: 00m 43s) [17:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:13] T158127: Catchable fatal error: Object of class __PHP_Incomplete_Class could not be converted to string in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php on line 124 - https://phabricator.wikimedia.org/T158127 [17:08:59] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3029929 (10mobrovac) >>! In T146664#3027004, @Halfak wrote: > @mobrovac, let me try again. Who from #operations did you... [17:09:53] (03PS10) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [17:10:18] (03PS1) 10Thcipriani: Revert "Revert "Group0 to 1.29.0-wmf.12"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337876 (https://phabricator.wikimedia.org/T155527) [17:10:43] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group0 to 1.29.0-wmf.12"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337876 (https://phabricator.wikimedia.org/T155527) (owner: 10Thcipriani) [17:12:41] (03CR) 10Ottomata: [C: 032] Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) (owner: 10Ottomata) [17:12:43] (03Merged) 10jenkins-bot: Revert "Revert "Group0 to 1.29.0-wmf.12"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337876 (https://phabricator.wikimedia.org/T155527) (owner: 10Thcipriani) [17:12:53] (03CR) 10jenkins-bot: Revert "Revert "Group0 to 1.29.0-wmf.12"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337876 (https://phabricator.wikimedia.org/T155527) (owner: 10Thcipriani) [17:14:21] RECOVERY - Elasticsearch HTTPS on elastic1050 is OK: SSL OK - Certificate elastic1050.eqiad.wmnet valid until 2022-02-14 17:12:41 +0000 (expires in 1824 days) [17:16:01] RECOVERY - Elasticsearch HTTPS on elastic1049 is OK: SSL OK - Certificate elastic1049.eqiad.wmnet valid until 2022-02-14 17:14:47 +0000 (expires in 1824 days) [17:16:11] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [17:17:30] thcipriani: so still the same ? :( [17:17:35] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Group0 to 1.29.0-wmf.12 T155527 [17:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:39] T155527: MW-1.29.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T155527 [17:18:01] RECOVERY - Elasticsearch HTTPS on elastic1052 is OK: SSL OK - Certificate elastic1052.eqiad.wmnet valid until 2022-02-14 17:17:07 +0000 (expires in 1824 days) [17:18:31] (03CR) 10BryanDavis: "Untested, but it broadly looks good. Test it and make it so. :)" [puppet] - 10https://gerrit.wikimedia.org/r/337787 (owner: 10Madhuvishy) [17:19:06] (03PS3) 10Yuvipanda: tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) [17:19:10] (03PS1) 10Ottomata: Configure analytics cluster nodes to use thirdparty/cloudera apt component [puppet] - 10https://gerrit.wikimedia.org/r/337877 (https://phabricator.wikimedia.org/T155726) [17:19:31] hashar: I no longer see the error, looks like AaronSchul.z patch worked \o/ [17:19:44] \O/ [17:21:14] guess that prepare for group1 later today [17:21:40] (03PS3) 10Madhuvishy: tools: Read list of tools for precise email reminder from precise-tools dashboard [puppet] - 10https://gerrit.wikimedia.org/r/337787 [17:21:50] (03CR) 10Madhuvishy: [V: 032 C: 032] tools: Read list of tools for precise email reminder from precise-tools dashboard [puppet] - 10https://gerrit.wikimedia.org/r/337787 (owner: 10Madhuvishy) [17:23:52] 06Operations, 10ops-eqiad, 06DC-Ops: elastic1051 not booting from PXE - https://phabricator.wikimedia.org/T158197#3029967 (10Cmjohnson) 05Open>03Resolved @gehel, There was a vlan conflict. Fixed! [17:27:07] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3029987 (10Halfak) Great! But note that ORES is stateless unless you consider our cache to be "state". Surely "consens... [17:28:37] (03PS4) 10Yuvipanda: tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) [17:28:42] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Make DNS point to labsdb1004 and not 1005 [puppet] - 10https://gerrit.wikimedia.org/r/337775 (https://phabricator.wikimedia.org/T123731) (owner: 10Yuvipanda) [17:30:11] PROBLEM - NTP on db2062 is CRITICAL: NTP CRITICAL: Offset unknown [17:36:24] (03PS1) 10BryanDavis: Refactor apt-get actions in Dockerfiles [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 [17:36:42] 06Operations, 10ops-eqiad, 06DC-Ops: elastic1051 not booting from PXE - https://phabricator.wikimedia.org/T158197#3030019 (10Gehel) Thanks! [17:37:17] (03CR) 10BryanDavis: "Totally untested at this point. Hoping that Yuvi can help me design a test plan." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 (owner: 10BryanDavis) [17:38:59] (03CR) 10Yuvipanda: "Nice!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 (owner: 10BryanDavis) [17:39:33] (03PS1) 10Jcrespo: Upgrade toolsdb master to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/337881 (https://phabricator.wikimedia.org/T157358) [17:40:38] (03CR) 10Marostegui: [C: 031] Upgrade toolsdb master to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/337881 (https://phabricator.wikimedia.org/T157358) (owner: 10Jcrespo) [17:41:40] !log thcipriani@tin Synchronized php-1.29.0-wmf.11/includes/libs/rdbms/ChronologyProtector.php: [[gerrit:337878|Make ChronologyProtector::init() use instanceof instead of empty()]] T158127 (duration: 00m 41s) [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:45] T158127: Catchable fatal error: Object of class __PHP_Incomplete_Class could not be converted to string in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php on line 124 - https://phabricator.wikimedia.org/T158127 [17:53:58] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [17:53:58] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:54:21] ah, interesting [17:54:29] I think I should just silence this [17:55:03] (03CR) 10Jcrespo: [C: 032] Upgrade toolsdb master to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/337881 (https://phabricator.wikimedia.org/T157358) (owner: 10Jcrespo) [17:55:55] (03CR) 10Gehel: [C: 04-1] "To be honest, I have no idea (and I am way out of my comfort zone here). It was suggested to me to add this and that there was reasons for" [puppet] - 10https://gerrit.wikimedia.org/r/337804 (https://phabricator.wikimedia.org/T149845) (owner: 10Gehel) [17:56:30] done [17:58:31] !log stopping labsdb1005 mariadb + puppet in preparation for reimage [17:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:52] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3030097 (10elukey) Summary of today: * I followed https://github.com/piwik/piwik/issues/6398#issuecomment-91093146 and set `bulk_requests_use_transaction=0` manually to fix an error showing up... [18:04:27] 06Operations, 06Analytics-Kanban, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3030103 (10elukey) p:05Triage>03Normal a:05Milimetric>03elukey [18:07:08] (03PS1) 10Ottomata: Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) [18:11:14] (03CR) 10Rush: diamond: Allow providing puppet file reference to collector config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337769 (owner: 10Madhuvishy) [18:15:20] (03PS2) 10Ottomata: Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) [18:20:03] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:00] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3030219 (10Jgreen) [18:22:48] (03CR) 10jerkins-bot: [V: 04-1] Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) (owner: 10Ottomata) [18:24:51] (03PS3) 10Ottomata: Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) [18:27:28] (03PS4) 10Madhuvishy: diamond: Allow providing puppet file reference to collector config file [puppet] - 10https://gerrit.wikimedia.org/r/337769 [18:28:36] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3030240 (10Cmjohnson) the new disk is on-site please let me know when ready to swap out. [18:30:37] !log Stop MySQL and shutdown db2062 for maintenance - T156478 [18:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:42] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [18:38:53] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [18:38:57] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3030254 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1051.eqiad.wmnet'] ``` The... [18:39:03] RECOVERY - Check systemd state on db2062 is OK: OK - running: The system is fully operational [18:39:10] yay [18:39:13] RECOVERY - Check whether ferm is active by checking the default input chain on db2062 is OK: OK ferm input default policy is set [18:39:17] \o/ [18:43:01] (03PS2) 10Dzahn: add missing wikimania2005.m wikimania2006.m mobile names [dns] - 10https://gerrit.wikimedia.org/r/337522 (https://phabricator.wikimedia.org/T152882) [18:43:32] (03PS1) 10Chad: Add Dashiki to branch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337890 [18:43:47] (03CR) 10Dzahn: [C: 032] add missing wikimania2005.m wikimania2006.m mobile names [dns] - 10https://gerrit.wikimedia.org/r/337522 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [18:47:07] (03CR) 10Dzahn: "@MaxSem that exception in varnish is for 2012-2015, kind of strange anyways. missing before and after" [dns] - 10https://gerrit.wikimedia.org/r/337522 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [18:47:23] (03PS4) 10Ottomata: Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) [18:49:03] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:53:31] (03CR) 10Chad: [C: 032] Add Dashiki to branch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337890 (owner: 10Chad) [18:54:03] (03PS1) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2019 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) [18:54:33] 06Operations, 10DNS, 10Traffic, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2862553 (10Dzahn) Added the 2 missing Wikimania ones. https://wikimania2005.m.wikimedia.org/wiki/Main_Page https://wikimania2006.m.wikimedia.org/wiki/Main_Page [18:54:36] 06Operations, 10DNS, 10Traffic, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#3030347 (10Dzahn) Added the 2 missing Wikimania ones. https://wikimania2005.m.wikimedia.org/wiki/Main_Page https://wikimania2006.m.wikimedia.org/wiki/Main_Page [18:55:43] 06Operations, 10DNS, 10Traffic, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#3030349 (10Dzahn) [18:56:01] (03PS5) 10Ottomata: Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) [18:56:03] (03Merged) 10jenkins-bot: Add Dashiki to branch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337890 (owner: 10Chad) [18:56:48] (03CR) 10Dzahn: "https://wikimania2005.m.wikimedia.org/wiki/Main_Page and https://wikimania2006.m.wikimedia.org/wiki/Main_Page work already. But this shoul" [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [18:56:53] (03CR) 10jenkins-bot: Add Dashiki to branch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337890 (owner: 10Chad) [18:58:30] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783735 (10cscott) I'll pitch {T90914} as a better solution to the "name for different thumbnail sizes", as it generalizes that requirement and lets wi... [18:59:45] !log demon@tin Synchronized multiversion/submodules.json: no-op (duration: 00m 50s) [18:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T1900). [19:00:13] RECOVERY - NTP on db2062 is OK: NTP OK: Offset 0.003641575575 secs [19:01:03] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:30] 06Operations, 06Labs: Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#3029409 (10greg) (those tasks above that this task was mentioned in were all(?) in `#wikimedia-incident` as a follow-up/action item, should this one be as well?) [19:04:31] 06Operations, 06Labs: Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#3030391 (10chasemp) >>! In T158196#3030375, @greg wrote: > (those tasks above that this task was mentioned in were all(?) in `#wikimedia-incident` as a follow-up/action item, should th... [19:04:48] (03CR) 10Dzahn: [C: 032] jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [19:05:18] (03CR) 10Ottomata: [C: 032] Fix heapsize alert conditionals so that they work in labs [puppet] - 10https://gerrit.wikimedia.org/r/337886 (https://phabricator.wikimedia.org/T88640) (owner: 10Ottomata) [19:05:25] (03CR) 10Dzahn: [C: 032] "please submit without the dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [19:06:07] (03PS5) 10Dzahn: jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [19:06:39] (03PS2) 10BryanDavis: Refactor apt-get actions in Dockerfiles [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 [19:07:27] (03PS6) 10Dzahn: jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [19:07:36] (03CR) 10Dzahn: [V: 032 C: 032] jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [19:14:25] (03CR) 10Dzahn: "Is there an example of another check where this is used and really works? I just see one with desc " description => 'Kafka Cluster analyt" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [19:17:08] (03PS4) 10Dzahn: contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 (owner: 10Hashar) [19:18:28] (03PS5) 10Dzahn: contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 (owner: 10Hashar) [19:22:12] (03CR) 10Dzahn: [C: 032] contint: /var/lib/jenkins/builds is no more [puppet] - 10https://gerrit.wikimedia.org/r/337809 (owner: 10Hashar) [19:23:30] (03PS1) 10Dereckson: Enable Popups on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337897 [19:24:06] (03PS2) 10RobH: add Matthias Mullie to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/337055 [19:25:00] (03PS3) 10RobH: add Matthias Mullie to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/337055 [19:25:16] (03CR) 10Dzahn: "sudo rmdir /var/lib/jenkins/builds/ (empty directories) on contint1002 and contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/337809 (owner: 10Hashar) [19:25:48] * Dereckson adds to SWAT 337987 - Enable Popups on se.wikimedia [19:26:25] wait [19:26:54] MaxSem: graduate a feature from beta to non beta for a chapter wiki, that qualifies for SWAT or that's a "new feature" too? [19:27:11] (03CR) 10Dzahn: "we have this https://wikitech.wikimedia.org/wiki/UID which was for reserving specific UIDs for system users across hosts" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:27:16] I think it's swattable [19:28:44] MaxSem: se.wikimedia uses Popups extension in beta for some months, and Reading has asserted it's ok to enable it by default, but that would be the first wiki to have it in non beta mode [19:29:04] (03PS7) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [19:29:22] (03PS1) 10Yuvipanda: tools: Don't allow tools to set Service-Worker-Allowed [puppet] - 10https://gerrit.wikimedia.org/r/337898 (https://phabricator.wikimedia.org/T158216) [19:29:25] (03CR) 10RobH: [C: 032] add Matthias Mullie to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/337055 (owner: 10RobH) [19:29:56] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:30:09] (03CR) 10Ottomata: "In icinga here:" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [19:30:30] (03PS3) 10Yuvipanda: tools: Enable cronjobs for tools k8s [puppet] - 10https://gerrit.wikimedia.org/r/337776 (https://phabricator.wikimedia.org/T158155) [19:30:36] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Enable cronjobs for tools k8s [puppet] - 10https://gerrit.wikimedia.org/r/337776 (https://phabricator.wikimedia.org/T158155) (owner: 10Yuvipanda) [19:30:39] (03CR) 10Jcrespo: "Looks ok https://puppet-compiler.wmflabs.org/5485/" [puppet] - 10https://gerrit.wikimedia.org/r/337834 (owner: 10Jcrespo) [19:30:50] (03CR) 10Brian Wolff: [C: 031] tools: Don't allow tools to set Service-Worker-Allowed [puppet] - 10https://gerrit.wikimedia.org/r/337898 (https://phabricator.wikimedia.org/T158216) (owner: 10Yuvipanda) [19:30:53] (03PS3) 10Jcrespo: HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 [19:30:54] I checked extension code, that's the same initialisation logic, the only thing changes is some hooks won't be executed to add beta information [19:31:08] (03PS2) 10Yuvipanda: tools: Don't allow tools to set Service-Worker-Allowed [puppet] - 10https://gerrit.wikimedia.org/r/337898 (https://phabricator.wikimedia.org/T158216) [19:31:14] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Don't allow tools to set Service-Worker-Allowed [puppet] - 10https://gerrit.wikimedia.org/r/337898 (https://phabricator.wikimedia.org/T158216) (owner: 10Yuvipanda) [19:31:17] (03CR) 10Jcrespo: [C: 032] HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 (owner: 10Jcrespo) [19:31:34] (03PS4) 10Jcrespo: HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 [19:31:41] (03CR) 10Volans: "In the last patch set there are the remaining things that I had to do." (036 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [19:31:47] (03CR) 10Dzahn: "/etc/login.defs says (on jessie)" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:32:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3030534 (10RobH) 05stalled>03Resolved No objections were noted, so Matthias now has access to the researchers group. I also just babysat the puppet... [19:32:41] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3030536 (10RobH) a:05RobH>03None [19:34:20] (03CR) 10Dzahn: [C: 031] "these were just general comments to answer the "is there a list" question. for the scope of this change and per Jaime's comments, +1" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:35:30] (03PS2) 10Dereckson: Enable Popups on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337897 (https://phabricator.wikimedia.org/T68374) [19:35:49] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337897 (https://phabricator.wikimedia.org/T68374) (owner: 10Dereckson) [19:35:51] (03CR) 10Marostegui: [C: 031] Resolve hanging mysql group with uid 1000 for new reimages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:36:06] (03CR) 10Jcrespo: [C: 032] "Ok, but we could think about deploying something for stretch..." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:36:12] (03CR) 10Dzahn: [C: 031] "ah, thanks. +1 it is then" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [19:37:27] (03Merged) 10jenkins-bot: Enable Popups on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337897 (https://phabricator.wikimedia.org/T68374) (owner: 10Dereckson) [19:37:36] (03CR) 10jenkins-bot: Enable Popups on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337897 (https://phabricator.wikimedia.org/T68374) (owner: 10Dereckson) [19:37:53] (03CR) 10Jcrespo: [V: 032 C: 032] HAProxy: move templates under the role [puppet] - 10https://gerrit.wikimedia.org/r/337834 (owner: 10Jcrespo) [19:38:12] Live on mwdebug1002 [19:39:05] (03PS3) 10Jcrespo: Resolve hanging mysql group with uid 1000 for new reimages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) [19:42:16] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:42:50] (03PS3) 10Dzahn: zuul: monitor Gearman queue growing out of control [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [19:43:30] 06Operations, 10ops-eqiad: Decommission old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398#3030634 (10RobH) a:05RobH>03Cmjohnson >>! In T156398#3021778, @Cmjohnson wrote: > @robh are either of these under our service contract with juniper? > > The spare that failed stuck in loading (the spare)... [19:43:42] (03PS4) 10Jcrespo: Resolve hanging mysql group with uid 1000 for new reimages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) [19:44:18] Yes, works on https://se.wikimedia.org/wiki/Transparens (I had to test some pages before getting a working one) [19:44:24] (even with ?debug=true) [19:47:16] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3695535 keys, up 107 days 11 hours - replication_delay is 626 [19:47:46] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3695380 keys, up 107 days 11 hours - replication_delay is 653 [19:48:03] (03CR) 10Jcrespo: [C: 032] "Tested: https://puppet-compiler.wmflabs.org/5490/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [19:49:06] (03CR) 10Dzahn: [C: 032] zuul: monitor Gearman queue growing out of control [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [19:49:16] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3683156 keys, up 107 days 11 hours - replication_delay is 0 [19:49:52] (03PS5) 10Dzahn: jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 (owner: 10Hashar) [19:50:39] mutante: guten tag :) [19:50:46] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3682639 keys, up 107 days 11 hours - replication_delay is 53 [19:50:49] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Popups by default on se.wikimedia (T68374) (duration: 00m 41s) [19:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:54] T68374: Enable Hovercards on se.wikimedia.org (Swedish chapter wiki) - https://phabricator.wikimedia.org/T68374 [19:50:56] ottomata: mutante: thanks for the Zuul Gearman icinga check. Not sure how spammly it will end up though [19:50:58] t [19:51:10] yup [19:51:41] hashar: it's not paging, good enough :) "one way to find out" and adjust it [19:52:02] (03PS1) 10Marostegui: linux-host-entries: Remove precise from labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/337911 (https://phabricator.wikimedia.org/T157358) [19:52:08] yeah on some other task Chase mentioned we should probably page for some of those alarms [19:52:24] (03PS1) 10Jcrespo: Update mariadb module to deploy mysql group changes for stretch [puppet] - 10https://gerrit.wikimedia.org/r/337912 (https://phabricator.wikimedia.org/T100501) [19:52:25] yea, but that should be done after some adjusting of the values [19:52:27] though typically with our timezone we have a rather large coverage [19:52:28] once it's stable [19:52:32] yup [19:52:41] (03CR) 10Jcrespo: [C: 031] linux-host-entries: Remove precise from labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/337911 (https://phabricator.wikimedia.org/T157358) (owner: 10Marostegui) [19:52:45] we should watch the alert history after a while [19:52:50] I have set more or less random values. I think it is conservative enough and will not alarm too much [19:53:02] ok [19:53:21] and it supposedly should point to the associated Grafana board. I did n't know we could add url to the description [19:53:45] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen :D [19:53:54] red : jobs waiting, green : jobs running [19:54:19] we will see it on IRC and we can silence and adjust it, no problem [19:54:48] I should migrate the grafana boards to puppet. Then we could have the same thresholds in the graph AND in the icinga check [19:54:56] just wanted to avoid checks that say "unknown - not enough data points" and similar things [19:55:13] they are not actionable. but this is [19:56:21] I also added you as a reviewer to a few jenkins patches some are rather trivial [19:56:41] gotcha is that there is like 15 of them all in a long dependent chain, but the first trivial ones are really independent [19:57:05] hashar: already seen, commented on dependencies, rebased and merged :) [19:57:05] but ho wait. You already merged some! [19:57:25] ultimate goal is to harness Jenkins with systemd :] [19:57:33] and maybe later on add firejail to it [19:57:38] yep, nice! [19:58:51] hashar: i was now looking at the logrotate [19:59:15] https://wiki.jenkins-ci.org/display/JENKINS/Access+Logging isnt't much .. yea [19:59:40] I had disabled at some point but I dont quite remember why [19:59:57] I think I was trying to SIGHUP the process via logrotate, but Jenkins doe snot handle signals and thus quit [20:00:03] just like SIGTERM or SIGKILL maybe [20:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T2000). [20:00:25] sssh jouncebot, I'm here now. [20:00:33] o/ [20:00:54] mutante: yeah the old logrotate script did a SIGALRM but that does not quite work [20:02:09] mutante: so instead I reuse the same logrotate parameters as the jenkins.log file which happens to be what the jenkins.deb provides [20:03:06] hashar: i see. ok. let's see.. i do see rotated logs in /var/log/jenkins/ though [20:03:11] on contint1001 [20:03:21] not the access log though.. yep [20:03:27] yes, all makes sense [20:04:04] (03CR) 10Dzahn: [C: 032] jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 (owner: 10Hashar) [20:05:13] (03CR) 10Jcrespo: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/5491/" [puppet] - 10https://gerrit.wikimedia.org/r/337912 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [20:05:22] (03PS2) 10Jcrespo: Update mariadb module to deploy mysql group changes for stretch [puppet] - 10https://gerrit.wikimedia.org/r/337912 (https://phabricator.wikimedia.org/T100501) [20:05:34] (03CR) 10Jcrespo: [V: 032 C: 032] Update mariadb module to deploy mysql group changes for stretch [puppet] - 10https://gerrit.wikimedia.org/r/337912 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [20:06:09] mutante: yeah I guess I commented out the access.log logrotate since that killed jenkins, then forgot to find a proper solution [20:06:22] will monitor tomorrow morning and friday morning and verify they rotated properly [20:06:38] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#732861 (10Jdlrobson) It's not merged yet but will be soon. See T156800. We plan to deploy within 2 weeks to beta features: T15... [20:06:42] !log contint1001 - logrotate --force /etc/logrotate.d/jenkins to test gerrit:337383 [20:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:49] oh [20:06:51] even better [20:07:07] hashar: it rotated the access.log to access.log.1 [20:07:12] \o/ [20:07:16] :) [20:09:05] 06Operations, 10DBA, 13Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#3030736 (10jcrespo) The user part should be fixed, or fixed when all trusties are decommissioned. The group part will take effect starting on stretch. This is mostly done... [20:09:48] (03CR) 10Jcrespo: [V: 032 C: 032] linux-host-entries: Remove precise from labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/337911 (https://phabricator.wikimedia.org/T157358) (owner: 10Marostegui) [20:10:02] (03PS2) 10Jcrespo: linux-host-entries: Remove precise from labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/337911 (https://phabricator.wikimedia.org/T157358) (owner: 10Marostegui) [20:10:08] (03CR) 10Jcrespo: [V: 032 C: 032] linux-host-entries: Remove precise from labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/337911 (https://phabricator.wikimedia.org/T157358) (owner: 10Marostegui) [20:10:17] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:10:22] hashar: you still want the one to disable logging though? [20:10:39] jenkins: allow access log to be flipped [20:10:40] yeah [20:10:45] https://gerrit.wikimedia.org/r/#/c/337385/ [20:10:46] yea, that one [20:10:51] I might end up disabling it by default [20:11:03] ok [20:11:03] and then just have them flipped on when we need them. Typically for debugging [20:11:40] last time I used them I think it was when we have setup Nodepool 1 year+ ago [20:12:06] Chad had a comment about it though on https://gerrit.wikimedia.org/r/#/c/337385/2/modules/jenkins/manifests/init.pp [20:12:10] sounds good to avoid unneccesary access logs, yep [20:12:46] yea, hmm. i dont know the "validate_bool" thing yet [20:12:57] not sure how validate_bool() is helpful. I guess it is plain paranoia from me [20:13:34] guess I can just drop it [20:16:09] bah it conflicts [20:16:34] mutante: wanna sprint the few others or is that time for lunch ? [20:17:03] i'd like to look at more patches later and not sprint [20:17:12] in other news, the icinga check has been added righ now [20:17:18] great [20:17:22] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['labsdb1005.eqiad.wmnet'] ``` The lo... [20:17:47] and... it is UNKNOWN :P [20:17:50] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:17:58] glblblm [20:18:01] that was my concern [20:18:04] when i saw "check graphite" [20:18:19] "Service is critically flapping: 10 data below and 16 above the confidence bounds" [20:18:31] also, if it's flapping, it should not call it UNKNOWN [20:18:32] that is the first time I see an unknown state / purple one [20:18:35] that's a different status [20:18:53] i guess it does not exit with the proper exit code [20:19:00] i saw them a lot, common with the graphite checks [20:19:30] but i guess we removed the ones using check_graphite [20:19:42] as opposed to check_graphite_anomaly [20:20:12] maybe this will straighten out after a little time [20:20:33] let it gather some values and see if it starts working [20:20:40] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 7 data above and 20 below the confidence bounds [20:20:54] well now it's not UNKNOWN anymore [20:21:40] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint2001 is CRITICAL: CRITICAL: Anomaly detected: 6 data above and 22 below the confidence bounds [20:22:08] (03CR) 10Dzahn: "< icinga-wm> PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [20:23:17] checking [20:29:55] here are 9 other UNKNOWNS that are graphite checks and cant find data points https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&serviceprops=1048576 [20:29:59] made one with confidence bands at https://grafana-admin.wikimedia.org/dashboard/db/zuul-gearman?panelId=20&fullscreen [20:31:49] hashar: o/ not sure if this is an ops thing but I didn't get any response in -analytics - https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=All doesn't seem to have `ExternalLinksChange` as a schema [20:32:30] guess they renamed it [20:32:42] or it hasnt been deployed yet [20:32:51] hashar: is there actually an absolute number of "jobs in queue" that are reason to alert? [20:33:14] as opposed to variable limits based on the data in the past [20:33:16] yeah more or less [20:33:23] if you look at the top left graph on https://grafana.wikimedia.org/dashboard/db/zuul-gearman [20:33:29] that shows in red the jobs that are waiting [20:33:32] isn't it more reliable and simple then to alert on a fixed number? [20:33:48] we had some oddity today between 15:00 and 16:00 which reflect has a huge red mountain [20:33:57] the check anomaly is supposed to report that [20:34:22] so let's say there is always a really high number of jobs [20:34:31] that would be considered normal [20:34:40] because it's not an anomaly [20:35:29] ./check_graphite -U https://graphite.wikimedia.org check_anomaly --check_window 30 -W 5 -C 10 zuul.geard.queue.waiting [20:35:31] to reproduce [20:35:42] if I pass --over it is OK [20:35:46] is our concern really the anomaly from normal.. or the absolute number of jobs [20:36:10] it is normal for the number of jobs to spike [20:36:21] ok [20:36:23] but not if it is too much [20:36:35] then it is really an experiment, maybe it is easier to just set a fixed threshold [20:36:55] my concern is if we set it to for example 100 and we have a quick spike we get an alarm [20:37:12] or if we have 99 jobs waiting for half an hour the alarm will not kick off [20:39:30] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [20:39:47] yea.. so this is what happened with other checks using check_graphite, it seemed to be unreliable or hard to adjust to the right values [20:39:51] I am preparing a patch [20:40:00] ok [20:40:35] i'll go to the lunch break. will keep watching that and gerrit of course [20:42:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3030805 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labsdb1005.eqiad.wmnet'] ``` and were **ALL** successful. [20:42:10] (03PS1) 10Hashar: zuul: tweak Gearman queue alarm [puppet] - 10https://gerrit.wikimedia.org/r/337916 [20:42:20] mutante: enjoy the lunch and above patch would fix it up [20:42:32] I should have spent some extra time to actually test the probe. Sorry about that [20:45:50] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:47:00] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:51:16] (03PS6) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [20:52:25] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:55:03] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3030855 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1051.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1051.eqi... [20:58:55] (03PS1) 10Hashar: zuul: point queue alarm to proper Graph panel [puppet] - 10https://gerrit.wikimedia.org/r/337925 [20:58:57] (03PS1) 10Hashar: zuul: pass ensure to the queue graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/337926 [20:59:33] that should cover it :] [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170215T2100). [21:00:15] (03CR) 10Hashar: "The URL is a bit long:" [puppet] - 10https://gerrit.wikimedia.org/r/337916 (owner: 10Hashar) [21:00:24] Nothing for ORES today [21:01:41] 06Operations, 10Ops-Access-Requests, 10Icinga, 10Monitoring, 06Release-Engineering-Team: Rename Icinga contact 'amusso' to 'hashar' - https://phabricator.wikimedia.org/T158167#3030863 (10RobH) a:03RobH I don't mind making this change, I'll just restrict myself to doing so tomorrow in the AM. (This has... [21:02:17] (03CR) 10jerkins-bot: [V: 04-1] zuul: pass ensure to the queue graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/337926 (owner: 10Hashar) [21:05:12] (03CR) 10Papaul: [C: 032] partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) (owner: 10Dzahn) [21:08:03] (03PS2) 10Hashar: zuul: pass ensure to the queue graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/337926 [21:10:05] (03PS7) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [21:12:38] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 3 data above and 12 below the confidence bounds [21:13:58] (03CR) 10Nuria: Changes to perf consumer of event logging events (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [21:14:38] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 3 data above and 12 below the confidence bounds [21:15:38] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [21:15:38] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint2001 is OK: OK: No anomaly detected [21:15:58] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:24:27] (03CR) 10Chad: "Screenshot of before/after?" [puppet] - 10https://gerrit.wikimedia.org/r/337397 (owner: 10Ladsgroup) [21:43:37] (03PS1) 10Jcrespo: toolsdb: Increase innodb log file size to 500M (1 GB total) [puppet] - 10https://gerrit.wikimedia.org/r/337990 (https://phabricator.wikimedia.org/T157358) [21:47:23] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10Papaul) [21:50:24] (03CR) 10Dzahn: [C: 032] zuul: tweak Gearman queue alarm [puppet] - 10https://gerrit.wikimedia.org/r/337916 (owner: 10Hashar) [21:50:33] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031031 (10Papaul) [21:50:51] jouncebot next [21:50:51] In 2 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000) [21:51:32] (03CR) 10Dzahn: [C: 032] zuul: point queue alarm to proper Graph panel [puppet] - 10https://gerrit.wikimedia.org/r/337925 (owner: 10Hashar) [21:51:53] (03PS6) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [21:52:01] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10Papaul) [21:52:22] (03CR) 10Dzahn: [C: 032] zuul: pass ensure to the queue graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/337926 (owner: 10Hashar) [21:55:38] wikidata db read and write traffic just increased a 33% [21:56:01] more than that [21:56:55] Krinkle: https://gerrit.wikimedia.org/r/#/c/338005/1 [21:58:16] (03CR) 10Paladox: "I doint, notice a difference. I've deployed this on https://gerrit.git.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/337397 (owner: 10Ladsgroup) [21:58:34] (03PS7) 10Dzahn: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [21:59:27] (03CR) 10Dzahn: [C: 032] "Giuseppe's concerns have been addressed. The intention is obvious and it's consistent with commands for Apache service." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [22:00:21] schema plans discussion about to start in #wikimedia-office [22:00:34] brion i may listen in [22:00:39] cool [22:00:59] mutante: Danke :) will look at the zuul icinga anomaly tomorrow again [22:01:31] hashar: de rien. yep, will also keep an eye on it [22:01:38] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [22:02:38] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [22:04:49] :p [22:05:17] runs puppet on einsteinium [22:07:24] yep, that should be better now, and the one on 2001 was removed [22:08:00] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:08:08] (03PS1) 10Eevans: Revert "Enable Prometheus exporter on restbase1007 (canary)" [puppet] - 10https://gerrit.wikimedia.org/r/338010 (https://phabricator.wikimedia.org/T155120) [22:08:16] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3031118 (10Dzahn) [22:08:34] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2072646 (10Dzahn) lasdb1005 is gone. count: 6 [22:08:40] mutante: awesome :) [22:09:10] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman-panelId-20-fullscreen-var-check_window-30-from-now-30m-to-now on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 0 below the confidence bounds [22:09:49] and of course the url ampersands are stripped bah [22:10:07] ah. yea problem, dashboard not fund [22:10:22] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=20&fullscreen&from=now-30m&to=now [22:10:40] so I guess just use the generic board :( https://grafana.wikimedia.org/dashboard/db/zuul-gearman [22:10:54] or URL shortener :) [22:11:07] can you use w.wiki? heh [22:11:26] the idea was to pass the icinga check_window value as a url parameter [22:11:31] to have them in sync [22:12:26] ok [22:14:09] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman-panelId-20-fullscreen-var-check_window-30-from-now-30m-to-now on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 1 below the confidence bounds daniel_zahn just added - https://phabricator.wikimedia.org/T70113 [22:14:20] hashar heres a idea do something similar to mediawiki where it assumes using regex or something and shows page stating did you mean (correct url with &) [22:14:51] i think the URL should go in "Status Information" column, not the service name [22:15:06] (03PS1) 10Yuvipanda: Revert "tools: Make DNS point to labsdb1004 and not 1005" [puppet] - 10https://gerrit.wikimedia.org/r/338012 (https://phabricator.wikimedia.org/T123731) [22:15:36] it's a bit ugly to have that long URL as the _name of the service_ [22:15:44] mutante: I will have to read the source tomorrow [22:15:50] and it also might not filter the ampersands if it's in info column [22:16:07] alright, yep [22:17:03] the check_window is not what I thought it was [22:17:17] I thought it was the holt-winters delta but it is hardcoded to 5 [22:17:24] oh..ah [22:17:29] so https://grafana-admin.wikimedia.org/dashboard/db/zuul-gearman?panelId=20&fullscreen&from=now-30m&to=now-5m [22:17:34] (I have updatd the graph) [22:17:47] :) [22:17:58] so yeah some queues points are above in the last check_window=30 datapoints [22:18:05] gotta revisit but that willbe for tomorrow [22:18:18] it might adjust until then.. *nod*, yes [22:18:25] over 6 hours [22:18:31] that shows it quickly detects spikes [22:18:39] so that would probably work [22:20:13] (03PS1) 10Hashar: Revert "zuul: point queue alarm to proper Graph panel" [puppet] - 10https://gerrit.wikimedia.org/r/338014 [22:20:48] that will clean the url reported, reverting to pointing to the dashboard [22:20:50] that is good enough [22:21:06] yes [22:21:28] (03PS2) 10Dzahn: Revert "zuul: point queue alarm to proper Graph panel" [puppet] - 10https://gerrit.wikimedia.org/r/338014 (owner: 10Hashar) [22:21:31] basd on https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=20&fullscreen [22:21:36] it will alarms a bit too much probably [22:21:51] hard to tell because the graph does not show when there are 30 data points above the upper confidence band [22:21:58] maybe that is doable in Graphana, will have to check [22:22:14] so monitoring::graphite_anomaly calls it "description" but it's actually the service name [22:22:20] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031214 (10RobH) [22:22:22] looking at that [22:22:42] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#711581 (10mmodell) Bump. I think that we should either 1. merge @chasemp's patch, or 2. amend it to say Apache 2.0 and merge that Ei... [22:23:20] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:24:02] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3031236 (10Paladox) [22:24:20] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:25:17] mutante: I am sleeping now. Thanks again for all the patches [22:25:28] hashar: no problem. good night, cu tomorrow [22:34:10] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman-panelId-20-fullscreen-var-check_window-30-from-now-30m-to-now on contint1001 is OK: OK: No anomaly detected [22:34:34] (03CR) 10Dzahn: [C: 032] Revert "zuul: point queue alarm to proper Graph panel" [puppet] - 10https://gerrit.wikimedia.org/r/338014 (owner: 10Hashar) [22:34:37] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031361 (10RobH) @Papaul: The task description currently has a section for db70 reading: db2070 [] - setup new port configuration ge-6/0/18 [] - remove old port configuration ge-5/0/ However, I cur... [22:37:00] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:37:58] (03PS3) 10Dzahn: partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) [22:38:43] (03CR) 10Dzahn: "dunno.. maybe even if it's not used right now, we want it again in the future? hmmmm" [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) (owner: 10Dzahn) [22:39:24] robh: ^ i wonder ... unused partman recipe ... [22:40:36] paladox: see update on https://gerrit.wikimedia.org/r/#/c/333358/ [22:40:56] mutante thanks. [22:41:04] (03Abandoned) 10MaxSem: Enable mobile redirection for all wikimanias [puppet] - 10https://gerrit.wikimedia.org/r/337767 (owner: 10MaxSem) [22:41:17] (03CR) 10MaxSem: [C: 031] adjust wikimania regex for mobile hosts, cover 2002-2019 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn) [22:48:08] (03PS2) 10Yuvipanda: Revert "tools: Make DNS point to labsdb1004 and not 1005" [puppet] - 10https://gerrit.wikimedia.org/r/338012 (https://phabricator.wikimedia.org/T123731) [22:48:18] (03CR) 10Yuvipanda: [V: 032 C: 032] Revert "tools: Make DNS point to labsdb1004 and not 1005" [puppet] - 10https://gerrit.wikimedia.org/r/338012 (https://phabricator.wikimedia.org/T123731) (owner: 10Yuvipanda) [22:50:01] (03Abandoned) 10Jcrespo: toolsdb: Increase innodb log file size to 500M (1 GB total) [puppet] - 10https://gerrit.wikimedia.org/r/337990 (https://phabricator.wikimedia.org/T157358) (owner: 10Jcrespo) [22:51:03] (03CR) 10Dzahn: "adding Muehlenhoff" [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore) [22:52:20] (03CR) 10Dzahn: [C: 031] "+1 assuming it's true that there is no NDA required for grafana which i don't know for sure." [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore) [22:52:49] !log demon@tin Synchronized php-1.29.0-wmf.12/extensions/Dashiki: prep-type stuff (duration: 00m 50s) [22:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:29] !log thcipriani@tin Synchronized php-1.29.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php: [[gerrit:338016|Add version to ChronologyProtector key]] T158217 (duration: 00m 41s) [22:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:34] T158217: Fatal error: Call to undefined method __PHP_Incomplete_Class::asOfTime() in /srv/mediawiki/php-1.29.0-wmf.11/includes/libs/rdbms/ChronologyProtector.php on line 327 - https://phabricator.wikimedia.org/T158217 [22:56:23] (03CR) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [22:56:32] (03CR) 10Dzahn: "adding paravoid" [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [22:57:55] (03CR) 10Dzahn: "@paladox let's abandon and start with multiple smaller patches" [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [22:58:10] (03Abandoned) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [22:58:19] (03CR) 10Dzahn: [C: 031] "needs maintenance window. together with LDAP change" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:00:31] (03Abandoned) 10Dzahn: (WIP) services: create global service restart script [puppet] - 10https://gerrit.wikimedia.org/r/325039 (owner: 10Dzahn) [23:02:56] how to properly check that https://gerrit.wikimedia.org/r/#/c/292785/3 is a no-op ? [23:03:13] gotta test _every single redirect_ [23:09:57] /12/8 [23:18:06] jouncebot: next [23:18:06] In 0 hour(s) and 41 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000) [23:18:15] too soon [23:20:24] (03Draft1) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [23:20:29] (03PS2) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [23:23:51] (03CR) 10Paladox: "Tested locally and it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [23:24:40] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3031456 (10jcrespo) This is mostly done, no major incidents- servers where only in read-only for a few seconds before and after the maintenance, for switc... [23:28:57] (03PS3) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [23:33:34] (03PS4) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [23:33:49] (03CR) 10Paladox: "Working example https://secure.phabricator.com/T4181#204679" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [23:35:15] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:37:32] 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031499 (10Papaul) [23:47:45] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:53:55] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [23:54:55] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms