[00:15:08] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 38.57 seconds [00:37:09] (03PS1) 10Bearloga: statistics::discovery: Manage datasets dir [puppet] - 10https://gerrit.wikimedia.org/r/371769 (https://phabricator.wikimedia.org/T170494) [00:58:19] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.43 seconds [01:03:18] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 45.01 seconds [02:39:31] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 08m 29s) [02:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:59] (03CR) 10Legoktm: "What is this limit for exactly? Earlier in the hackathon addshore hit this as well - except he had 12 patches, only one away from 11." [puppet] - 10https://gerrit.wikimedia.org/r/371739 (owner: 10Greg Grossmeier) [02:49:09] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3523208 (10Marostegui) p:05Unbreak!>03Normal This graphs shows that the traffic got back to normal rate hours ago. What caused this...I don't know. But I think we can lower (if not close) this is... [03:08:25] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.13) (duration: 06m 47s) [03:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 14 03:15:31 UTC 2017 (duration 7m 6s) [03:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 821.91 seconds [04:32:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.00 seconds [05:56:13] (03CR) 10Giuseppe Lavagetto: systemd::service: convert a bunch of modules to it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371481 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [06:49:33] (03PS4) 10Giuseppe Lavagetto: systemd::service: convert a bunch of modules to it [puppet] - 10https://gerrit.wikimedia.org/r/371481 (https://phabricator.wikimedia.org/T173078) [06:51:08] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd::service: convert a bunch of modules to it [puppet] - 10https://gerrit.wikimedia.org/r/371481 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [06:55:15] (03Abandoned) 10Elukey: Update changelog for version 0.0.11-2~jessie [debs/logster] - 10https://gerrit.wikimedia.org/r/367383 (https://phabricator.wikimedia.org/T171318) (owner: 10Elukey) [07:04:29] PROBLEM - Disk space on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:29] PROBLEM - puppet last run on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:38] PROBLEM - configured eth on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:48] PROBLEM - kvm ssl cert on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:58] PROBLEM - dhclient process on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:09] PROBLEM - DPKG on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:09] PROBLEM - salt-minion processes on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:18] PROBLEM - nova-compute process on labvirt1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:49] RECOVERY - dhclient process on labvirt1018 is OK: PROCS OK: 0 processes with command name dhclient [07:09:08] RECOVERY - DPKG on labvirt1018 is OK: All packages OK [07:09:08] RECOVERY - salt-minion processes on labvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:09:08] RECOVERY - nova-compute process on labvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [07:09:28] RECOVERY - Disk space on labvirt1018 is OK: DISK OK [07:09:28] RECOVERY - puppet last run on labvirt1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:09:29] RECOVERY - configured eth on labvirt1018 is OK: OK - interfaces up [07:09:38] RECOVERY - kvm ssl cert on labvirt1018 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days. [07:45:57] (03PS3) 10Giuseppe Lavagetto: prometheus: convert to systemd::service where needed [puppet] - 10https://gerrit.wikimedia.org/r/371482 (https://phabricator.wikimedia.org/T173078) [07:49:00] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus: convert to systemd::service where needed [puppet] - 10https://gerrit.wikimedia.org/r/371482 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [07:52:49] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:53:38] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:53:40] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:53:59] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:54:08] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:18] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:38] <_joe_> ouch that's me [07:55:38] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:39] <_joe_> fixing [07:55:48] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:47] (03PS1) 10Giuseppe Lavagetto: prometheus::varnish_exporter: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/371922 [07:56:48] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:48] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:02] (03PS12) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [07:57:18] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:21] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::varnish_exporter: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/371922 (owner: 10Giuseppe Lavagetto) [07:57:38] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:39] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:39] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:48] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:58] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:08] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:29] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:48] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:48] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:58] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:18] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:28] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:48] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:49] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:00:29] <_joe_> the issue is now fixed [08:00:38] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:00:41] <_joe_> let me try to break the varnish puppet code yet again. [08:04:22] (03CR) 10Alexandros Kosiaris: "Submit a followup patch ? :P" [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [08:06:40] (03PS13) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [08:08:36] (03CR) 10Giuseppe Lavagetto: "> This is a good effort, but I feel this needs more work. Since this" [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [08:19:28] (03PS2) 10Giuseppe Lavagetto: varnish: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371617 (https://phabricator.wikimedia.org/T173078) [08:20:01] (03CR) 10Gehel: [C: 04-1] statistics::discovery: Manage datasets dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371769 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [08:20:48] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:21:49] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:22:08] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:22:18] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:23:28] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:23:48] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:23:59] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:24:39] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:24:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3523403 (10Gehel) Checking a few servers on grafana, it looks like temperature is still down. This can be closed. [08:24:59] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:25:28] (03PS14) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [08:26:41] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/7425/cp1052.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/371617 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [08:27:16] <_joe_> win 25 [08:35:18] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:19] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:38] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:58] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:58] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:29] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:08] (03CR) 10Elukey: "New pcc: https://puppet-compiler.wmflabs.org/compiler02/7426/" [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:43:08] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:45:23] (03PS2) 10Giuseppe Lavagetto: thumbor,swift: convert to systemd::service and systemd::unit [puppet] - 10https://gerrit.wikimedia.org/r/371618 (https://phabricator.wikimedia.org/T173078) [08:54:35] 10Operations, 10Phabricator: Only allow Phabricator weekly project changes on prod - https://phabricator.wikimedia.org/T173297#3523424 (10Paladox) [08:55:16] 10Operations, 10Phabricator: Only allow Phabricator weekly project changes on prod - https://phabricator.wikimedia.org/T173297#3523436 (10Paladox) p:05Triage>03High Setting high as i doing want to be spamming especially people presuming aklapper is spamming which would not be true. [08:55:59] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:56:24] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3523439 (10Gehel) [08:57:07] (03PS10) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 (https://phabricator.wikimedia.org/T171704) [08:57:18] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:57:18] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:38] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:57:48] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:57:55] (03PS5) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [08:57:57] (03PS1) 10Jcrespo: Install mydumper on dbstore_multiinstance hosts, drop tls [puppet] - 10https://gerrit.wikimedia.org/r/371925 (https://phabricator.wikimedia.org/T169516) [08:58:08] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:58:18] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:58:28] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:58:42] (03PS2) 10Jcrespo: mariadb: Install mydumper on dbstore_multiinstance hosts, drop tls [puppet] - 10https://gerrit.wikimedia.org/r/371925 (https://phabricator.wikimedia.org/T169516) [08:59:27] (03PS3) 10Jcrespo: mariadb: Install mydumper on dbstore_multiinstance hosts, drop tls [puppet] - 10https://gerrit.wikimedia.org/r/371925 (https://phabricator.wikimedia.org/T169516) [09:00:02] (03CR) 10Jcrespo: [C: 032] mariadb: Install mydumper on dbstore_multiinstance hosts, drop tls [puppet] - 10https://gerrit.wikimedia.org/r/371925 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [09:01:53] (03PS1) 10MarcoAurelio: Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:02:12] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7427" [puppet] - 10https://gerrit.wikimedia.org/r/371618 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [09:02:23] (03PS3) 10Giuseppe Lavagetto: thumbor,swift: convert to systemd::service and systemd::unit [puppet] - 10https://gerrit.wikimedia.org/r/371618 (https://phabricator.wikimedia.org/T173078) [09:02:28] 10Operations, 10Phabricator: Only allow Phabricator weekly project changes on prod - https://phabricator.wikimedia.org/T173297#3523449 (10Paladox) [09:03:24] (03CR) 10jerkins-bot: [V: 04-1] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:05:01] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2001.codfw.wmnet [09:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:25] (03CR) 10Gehel: "I'm renaming the "r" module to "r_lang" which is probably a bit less confusing (https://gerrit.wikimedia.org/r/#/c/371075/). This will con" [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [09:06:20] (03Draft1) 10Paladox: Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 [09:06:22] (03PS2) 10Paladox: Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) [09:06:35] (03PS1) 10Elukey: geowiki: delay cronjobs to reduce cronspam [puppet] - 10https://gerrit.wikimedia.org/r/371928 [09:06:59] (03CR) 10Paladox: Phabricator: Only send logmail on prod not labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [09:07:25] <_joe_> !log stopping thumbor on thumbor2001 after depooling it for testing [09:07:29] (03CR) 10Elukey: [C: 032] geowiki: delay cronjobs to reduce cronspam [puppet] - 10https://gerrit.wikimedia.org/r/371928 (owner: 10Elukey) [09:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:27] _joe_ whenever you are ready you can merge my changes [09:08:40] <_joe_> elukey: ok, one sec [09:08:45] (03PS3) 10Paladox: Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) [09:08:59] (03PS2) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:09:14] Hi, Does anyone know why i carn't add dzahn to ^^ please? [09:09:18] I am getting this error [09:09:19] "Change not visible to dzahn" [09:10:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:10:45] (03PS3) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:11:19] paladox: to where? [09:11:24] On gerrit [09:11:28] puppet changes [09:11:36] https://gerrit.wikimedia.org/r/371927 ? [09:11:43] yeh [09:11:47] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3523462 (10elukey) I got fooled by dmesg, eth0 was operating at 100 Mbps and it was not flapping. After re-negotiation I can see this:... [09:12:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:12:24] does not appear as dz.* in the dropdown of users I can select [09:12:29] yep [09:12:33] is he using another gerrit name? [09:12:48] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:12:50] (03PS1) 10Giuseppe Lavagetto: thumbor: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/371929 [09:13:14] Nope [09:13:28] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:55] (03CR) 10Giuseppe Lavagetto: [C: 032] thumbor: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/371929 (owner: 10Giuseppe Lavagetto) [09:13:56] weird [09:15:28] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:15:48] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [09:16:34] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2001.codfw.wmnet [09:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:30] (03PS2) 10Giuseppe Lavagetto: base::service_unit: convert services to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371619 (https://phabricator.wikimedia.org/T173078) [09:22:28] (03CR) 10Giuseppe Lavagetto: [C: 032] base::service_unit: convert services to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371619 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [09:26:18] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:40] <_joe_> that's me ^^ [09:27:19] (03PS4) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:27:56] 10Operations, 10hardware-requests: refresh hardware for logstash100[123] - https://phabricator.wikimedia.org/T173298#3523477 (10Gehel) [09:28:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:30:15] (03PS5) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:30:38] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:30:39] (03PS1) 10Giuseppe Lavagetto: eventlogging::service: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/371930 [09:31:17] (03CR) 10Giuseppe Lavagetto: [C: 032] eventlogging::service: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/371930 (owner: 10Giuseppe Lavagetto) [09:31:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:31:48] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:19] (03PS6) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:33:57] (03CR) 10MarcoAurelio: "Apparently it is failing because the -computed file must match the securepoll.dblist contents. I cannot run the expanddblist command." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:34:45] (03PS1) 10Giuseppe Lavagetto: eventlogging: further fix [puppet] - 10https://gerrit.wikimedia.org/r/371931 [09:34:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:35:19] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:32] (03CR) 10Giuseppe Lavagetto: [C: 032] eventlogging: further fix [puppet] - 10https://gerrit.wikimedia.org/r/371931 (owner: 10Giuseppe Lavagetto) [09:36:46] (03CR) 10MarcoAurelio: "Adding Matt as he created the flow-computed.dblist and maybe he'd like to help me here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:37:19] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:37:30] (03PS2) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [09:39:02] (03PS5) 10Gehel: rename "r" module to "r_lang" [puppet] - 10https://gerrit.wikimedia.org/r/371075 [09:40:21] (03CR) 10Gehel: [C: 032] rename "r" module to "r_lang" [puppet] - 10https://gerrit.wikimedia.org/r/371075 (owner: 10Gehel) [09:44:28] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [09:44:47] ^thats me, fix coming up [09:45:21] (03PS1) 10Gehel: r_lang : fix path to update-library.R [puppet] - 10https://gerrit.wikimedia.org/r/371932 [09:45:53] (03CR) 10Gehel: [C: 032] r_lang : fix path to update-library.R [puppet] - 10https://gerrit.wikimedia.org/r/371932 (owner: 10Gehel) [09:48:29] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:50:45] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3523524 (10elukey) Stopped again analytics1034, @Cmjohnson I think that the cable swap didn't work :( [09:57:59] (03PS7) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [09:58:48] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:59:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [09:59:34] * TabbyCat sighs [10:00:58] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:01:57] (03PS8) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [10:03:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [10:03:38] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:19:28] 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes on prod - https://phabricator.wikimedia.org/T173297#3523545 (10Aklapper) p:05High>03Low Setting low priority as this never has been a big issue. Priority = urgency. This is not urgent. Hence please do not set h... [10:19:46] 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3523547 (10Aklapper) [10:24:17] (03CR) 10Alexandros Kosiaris: [C: 031] Add LVS nonzero ranges in network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/370210 (https://phabricator.wikimedia.org/T170518) (owner: 10BBlack) [10:27:09] !log Restart MySQL on db1095 to pick up new replication filters [10:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:49] !log Restart MySQL on db1069 to pick up new replcation filters [10:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] (03PS9) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [10:46:58] Does logstash have API? [10:47:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [10:54:45] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371934 (https://phabricator.wikimedia.org/T151029) [10:57:01] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3523625 (10elukey) @Ottomata @Pchelolo Turns out I am stupid, Event Streams might not be the culprit. I remembered this... [10:57:21] ebernhardson: Do you know if logstash has API? [10:57:29] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371934 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [10:58:56] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371934 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [10:59:06] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371934 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [11:00:35] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2076 - T170662 T151029 (duration: 01m 02s) [11:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:47] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [11:00:47] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [11:01:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2076 - T170662 T151029 (duration: 00m 47s) [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:13] (03PS10) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [11:16:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [11:21:12] (03PS6) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [11:21:14] (03PS1) 10Jcrespo: mariadb: Disable buffer pool loading and dumping on new dbstores [puppet] - 10https://gerrit.wikimedia.org/r/371935 (https://phabricator.wikimedia.org/T169516) [11:27:10] (03CR) 10Jcrespo: [C: 032] mariadb: Disable buffer pool loading and dumping on new dbstores [puppet] - 10https://gerrit.wikimedia.org/r/371935 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [11:27:18] (03PS2) 10Jcrespo: mariadb: Disable buffer pool loading and dumping on new dbstores [puppet] - 10https://gerrit.wikimedia.org/r/371935 (https://phabricator.wikimedia.org/T169516) [11:32:08] (03PS1) 10Elukey: pmacct: add the possibility to configure librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/371936 (https://phabricator.wikimedia.org/T172681) [11:37:17] (03PS2) 10Elukey: pmacct: add the possibility to configure librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/371936 (https://phabricator.wikimedia.org/T172681) [11:49:43] (03PS1) 10Elukey: role::pmacct: explicitly set librkafka version [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) [11:52:48] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 when accessed from non-English languages specified in the template - https://phabricator.wikimedia.org/T171392#3523703 (10zhuyife... [11:54:12] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 when accessed from non-English languages specified in the template - https://phabricator.wikimedia.org/T171392#3523704 (10zhuyife... [12:03:12] (03PS3) 10Elukey: pmacct: add the possibility to configure librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/371936 (https://phabricator.wikimedia.org/T172681) [12:03:14] (03PS2) 10Elukey: role::pmacct: explicitly set librkafka version [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) [12:21:22] !log stopping replication on all instances of dbstore2001 T169516 [12:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:36] T169516: Implement cron-based mydumper backups on the dbstore role - https://phabricator.wikimedia.org/T169516 [12:41:45] (03CR) 10Elukey: [C: 032] pmacct: add the possibility to configure librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/371936 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [12:50:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "PS8 effectively hijacks this change to add stuff that are not in the scope of this change as stated in the change's commit message. Let's " [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [12:53:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This would not be sufficient. There's a global service timeout on icinga.cfg that's set to 60 secs as well. We would also have to bump tha" [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) (owner: 10Herron) [12:53:44] (03PS3) 10Elukey: role::pmacct: explicitly set librkafka version [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) [12:56:29] (03PS4) 10Elukey: role::pmacct: explicitly set librkafka version [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) [12:57:03] (03CR) 10Elukey: [C: 032] role::pmacct: explicitly set librkafka version [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [12:59:47] (03PS2) 10Giuseppe Lavagetto: base::service_unit: move template rendering to the caller [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T1300). Please do the needful. [13:02:54] 0Aug 14 12:58:13 rhenium nfacctd[17848]: WARN: [/etc/pmacct/nfacctd.conf:28] Unknown key: kafka_config_file. Ignored. [13:02:57] sigh [13:03:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Adds hieradata for ores::celery::workers with default. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [13:05:03] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3523808 (10akosiaris) I 've commented on the above change. @Halfak I am around again so let's schedule some time to look into this again. There... [13:08:21] !log nothing for EU SWAT [13:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:15] of course I didn't check the release, stretch has 1.6.1 meanwhile what I need is 1.6.2 [13:10:18] https://github.com/pmacct/pmacct/blob/master/ChangeLog [13:10:19] * elukey cries [13:12:19] (03CR) 10Elukey: "This didn't work, I missed to check what version supports the kafka_config_file parameter and only 1.6.2 does, meanwhile on stretch we hav" [puppet] - 10https://gerrit.wikimedia.org/r/371937 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [13:16:07] 10Operations, 10ORES, 10Scoring-platform-team-Backlog: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3523811 (10akosiaris) >>! In T171851#3478511, @Halfak wrote: > I just talked to @fgiunchedi about this in IRC and he said that reimaging while they aren't getting traffic... [13:19:56] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3523813 (10elukey) This time the root cause seems found: ``` 2017-07-28 14:11 upgrading rhenium to stretch vi... [13:24:06] elukey: in meeting today, the pmacct is totally experimental and you can stop it if it's causing any kind of issues [13:24:11] s/the/but/ [13:25:22] paravoid: hi! Ok ok lemme try to stop it to see if it fixes the issue.. the puppet work is all done though, I missed to check the version :( [13:28:30] paravoid: yes it seems fixing the issue.. shall I just systemctl mask nfacctd ? Would it be ok? [13:29:14] (03PS11) 10Urbanecm: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [13:30:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [13:32:31] !log Execute systemctl mask nfacctd on rhenium.wikimedia.org for T172681 [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:43] T172681: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681 [13:32:56] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 33 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[nfacctd] [13:33:33] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3523854 (10elukey) nfacctd stopped, immediate recovery on the brokers logs (no more exceptions logged). Let's wait a bi... [13:33:46] elukey: yeah [13:33:47] (03CR) 10Urbanecm: "@MarcoAurelio I run the expanddblist binary and uploaded as new patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [13:39:08] (03PS12) 10Urbanecm: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [13:41:55] (03PS1) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [13:43:07] (03CR) 10Urbanecm: "@MarcoAurelio: Fixed, jenkins gave +2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [13:43:12] (03PS1) 10Giuseppe Lavagetto: Convert base::service_unit to the new structure [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/371940 [13:47:38] (03CR) 10Herron: "Ah, thanks! Will update to include service_check_timeout. Agreed about >60s though we do already have hp raid checks deployed with a 90s" [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) (owner: 10Herron) [13:47:56] (03PS2) 10Herron: Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) [13:49:15] (03PS3) 10Herron: Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) [13:51:25] 10Operations, 10ORES, 10Scoring-platform-team-Backlog: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3523885 (10Halfak) \o/ Sounds good. [14:06:09] (03PS2) 10Giuseppe Lavagetto: Convert base::service_unit to the new structure [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/371940 [14:06:44] (03CR) 10Giuseppe Lavagetto: [C: 032] "This will be reverted in case we abort the main change." [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/371940 (owner: 10Giuseppe Lavagetto) [14:06:59] (03Merged) 10jenkins-bot: Convert base::service_unit to the new structure [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/371940 (owner: 10Giuseppe Lavagetto) [14:08:57] (03PS3) 10Giuseppe Lavagetto: base::service_unit: move template rendering to the caller [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) [14:31:26] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [14:33:26] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [14:33:39] 10Operations: Reconfigure techconduct@wikimedia.org address - https://phabricator.wikimedia.org/T173314#3523977 (10Dereckson) [14:34:52] 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3523992 (10Dzahn) a:03Dzahn [14:47:29] (03PS7) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [14:47:32] (03PS1) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [14:49:06] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3524028 (10Cmjohnson) a:05Cmjohnson>03RobH @RobH These came with 10G nics, the nics have been disabled. [14:57:43] Amir1: depends, we have a problem naming things so what we call logstash is really logstash (routes log messages), elasticsearch (stores log messages), and kibana (pretty interface to elasticsearch). You can talk to elasticsearch directly if the right firewall holes are open [14:58:34] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3524055 (10Cmjohnson) The cable seemed loose and would disconnect at the server eth0 port when touched. Replaced the cable and moved i... [15:09:11] (03CR) 10MarcoAurelio: "> @MarcoAurelio I run the expanddblist binary and uploaded as new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [15:14:57] (03PS13) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [15:15:38] (03CR) 10Urbanecm: "This directory is supposed to contain symbolic links to files which should be publicly displayed at noc.wikimedia.org. Jenkins asserts all" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [15:16:42] (03CR) 10Giuseppe Lavagetto: [C: 031] role::analytics_cluster::hadoop::client: move to profiles (p2) (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:33:57] (03PS1) 10Hoo man: Use gzip -9 for compressing the Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/371946 [15:36:02] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3524181 (10Ottomata) @Cmjohnson Heyaaa, we are pretty ready and excited to start working with these. Can you let us know when they'l... [15:37:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3524196 (10Cmjohnson) @ottomata okay, I understand I will get them going as soon as I can there in my being worked on queue with a fe... [15:44:19] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3524215 (10Ottomata) Great, thank you! [15:47:03] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3524220 (10Nuria) [15:55:45] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1761 MB (3% inode=97%) [15:55:52] (03PS1) 10Ayounsi: Define management networks and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371949 [16:00:46] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1741 MB (3% inode=97%) [16:01:35] yes we got it icinga :) [16:01:36] checking [16:01:53] seems the /var/log dir [16:02:27] precisely /var/log/carbon [16:05:16] (in a meeting but will clear logs in a bit if nobody gets to it) [16:05:45] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1732 MB (3% inode=97%) [16:06:24] (03CR) 10Alexandros Kosiaris: [C: 031] Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) (owner: 10Herron) [16:10:46] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1698 MB (3% inode=97%) [16:13:54] 10Operations, 10Developer-Relations: Reconfigure techconduct@wikimedia.org address - https://phabricator.wikimedia.org/T173314#3524287 (10Aklapper) [16:15:15] RECOVERY - Host labvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [16:15:55] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1714 MB (3% inode=97%) [16:15:58] (03PS2) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [16:16:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3524330 (10Cmjohnson) @Andrew Today, I drained flea power (first step in all Dell troubleshooting) and cleared the system log. Let's let it go and see if you have anymore problems. [16:20:51] (03CR) 10Gehel: [C: 04-1] "Needs to be deployed with the corresponding wdqs patch" [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [16:20:55] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1697 MB (3% inode=97%) [16:21:55] herron: o/ - don't want to step into your work, are you doing anything on graphite2001? [16:22:49] not sure why all the listerner.log(s) have tons of " invalid line" [16:22:56] elukey no worries, was just looking at the free disk space [16:23:38] ahh okok.. I am seeing a lot of invalid line (librenms.etc.. [16:23:44] Cc XioNoX [16:25:04] elukey: what do you mean? [16:25:24] Hello :) [16:25:56] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1676 MB (3% inode=97%) [16:26:12] XioNoX: so in /var/log/carbon/etc../listener.log I can see a lot of "invalid line" log entries for various librenms devices [16:26:13] elukey: librenms + graphite makes me think about https://phabricator.wikimedia.org/T171167 [16:27:03] I think we can safely drop oldish log files [16:29:55] !log execute sudo find -type f -mtime +60 -exec rm {} \; in /var/lib/carbon on graphite2001 to free some space in / [16:29:56] RECOVERY - Disk space on graphite2001 is OK: DISK OK [16:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:11] ergh it was var/log/carbon [16:30:13] amending [16:34:38] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456248 (10elukey) Just removed some old log files from /var/log/carbon/* on graphite2001, they were full of things like `invalid line (librenms.x.y... [16:34:42] commented also in the task --^ [16:38:14] 10Operations, 10Analytics, 10Traffic, 10Varnish: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#3524406 (10elukey) Question for the Traffic team: is this task still valid after T138747 or shall we call it done? [16:41:53] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3524408 (10elukey) 05Open>03Resolved [16:44:02] (03CR) 10Gehel: base::service_unit: move template rendering to the caller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [16:46:00] 10Operations, 10Developer-Relations: Reconfigure techconduct@wikimedia.org address - https://phabricator.wikimedia.org/T173314#3524416 (10Dereckson) p:05Triage>03High (tentatively triaging as High as it's trivial enough to be directly actionable by someone with the access, and because we want this address... [16:46:29] 10Operations, 10Developer-Relations: Reconfigure techconduct@wikimedia.org address - https://phabricator.wikimedia.org/T173314#3524433 (10Dereckson) [16:47:33] 10Operations, 10Developer-Relations: Reconfigure techconduct@wikimedia.org address - https://phabricator.wikimedia.org/T173314#3523977 (10Dereckson) [16:59:05] 10Operations, 10JobRunner-Service, 10MediaWiki-Platform-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3499719 (10Anomie) >>! In T172479#3502395, @greg wrote: > Adding #mediawiki-platform-team as @aaron is mainta... [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T1700). [17:00:28] jouncebot: nothing scheduled for WDQS today... [17:05:08] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3524455 (10Cmjohnson) Replaced the ssd, needs re-install [17:09:34] !log resetting mode on stat1005:/srv/published-datasets/discovery recursively [17:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:22] (03Abandoned) 10Bearloga: statistics::discovery: Manage datasets dir [puppet] - 10https://gerrit.wikimedia.org/r/371769 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [17:15:20] (03CR) 10Giuseppe Lavagetto: base::service_unit: move template rendering to the caller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [17:19:22] (03CR) 10Gehel: [C: 031] "That explanation sounds good to me..." [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [17:32:52] (03PS1) 10Herron: Add 5 second "greet pause" delay to lists.wikimedia.org SMTP [puppet] - 10https://gerrit.wikimedia.org/r/371958 (https://phabricator.wikimedia.org/T173143) [17:52:40] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3524578 (10Multichill) @fgiunchedi Shame we missed at Wikimania! We'll keep in touch about this. [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T1800). Please do the needful. [18:06:30] (03Abandoned) 10EBernhardson: Switch elastic1017 to LVM [puppet] - 10https://gerrit.wikimedia.org/r/371210 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:08:31] (03PS1) 10EBernhardson: Revert "Switch elastic1017-1031 to niofs" [puppet] - 10https://gerrit.wikimedia.org/r/371962 (https://phabricator.wikimedia.org/T169498) [18:08:58] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3517417 (10Nick) The biggest issue is surely going to be people trying to upload their own images from Flickr without changing the licence to something Commons comp... [18:13:13] 10Operations, 10Traffic, 10Patch-For-Review: Non zero rated LVS IPs - https://phabricator.wikimedia.org/T170518#3524626 (10BBlack) 05Open>03stalled Re-evaluating alternatives here, hold on actual implementation for now. [18:17:47] 10Operations, 10Analytics, 10Traffic, 10Varnish: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#3524632 (10BBlack) I think there's still some work here to do, if nothing else to audit the situation as it stands. There's basically two things to... [18:20:30] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3524636 (10greg) [18:21:33] 10Operations, 10JobRunner-Service, 10MediaWiki-Platform-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3524646 (10aaron) Yeah that list should be updated. I happen to investigate things there often (for now), tho... [18:30:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3524651 (10debt) 05Open>03Resolved Woohoo! [18:30:53] 10Operations, 10JobRunner-Service, 10Performance-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3524653 (10greg) >>! In T172479#3524447, @Anomie wrote: >>>! In T172479#3502395, @greg wrote: >> Adding #mediawiki-p... [18:33:30] (03PS1) 10EBernhardson: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) [18:35:27] (03CR) 10EBernhardson: "I'm not 100% sure this is the right place, swift creates udev profiles from the profile and the module. Additionally this only changes md2" [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:46:57] ebernhardson: yeah, Thanks. I tried some stuff to query elasticsearch but didn't work [18:47:24] Amir1: how so? [18:47:43] like trying logstahs.wikimedia.org/_search [18:47:50] but gave me 404 [18:48:33] Amir1: ahh, are you trying to search from outside the prod cluster? [18:48:43] or from an instance inside the cluster? [18:48:49] outside [18:48:55] Isn't it possible? [18:50:13] Amir1: technically, but you'll had to provide ldap credentials. also you have to use the msearch interface, via logstash.wikimedia.org/elasticsearch/_msearch [18:50:30] That's great [18:50:33] let me try [18:50:35] Amir1: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html [19:27:20] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3524772 (10LilyOfTheWest) >>! In T173056#3524621, @Nick wrote: > The biggest issue is surely going to be people trying to upload their own images from Flickr withou... [19:42:14] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3524848 (10Papaul) Hi Papaul, I’m back in the office today and getting caught up on my emails and case backlog. I’m in the process of reviewing the case status and will provide feedback / update shortly. Reg... [19:51:08] (03PS1) 10Dereckson: Fix expanddblist shebang [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371966 [19:51:10] (03PS1) 10Dereckson: Fix notice when no argument is given [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371967 (https://phabricator.wikimedia.org/T173342) [19:51:30] (03PS1) 10Dereckson: Allow to regenerate computed dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) [19:52:55] (03CR) 10jerkins-bot: [V: 04-1] Allow to regenerate computed dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) (owner: 10Dereckson) [19:55:03] Fun ^ [19:55:21] the test fails but was imagined to only take .dblist files for one folder, not the other [19:55:24] if ( substr( $fname, -strlen( '.dblist' ) ) === '.dblist' ) { [19:55:47] (done on docroot/noc/conf/ side, not on dblists/ side) [19:57:02] (03CR) 10Dereckson: [C: 04-1] "I'll fix NocDblistTest to filter dblists/ to *.dblist too (filtering is only done at docroot/noc/conf/ level)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) (owner: 10Dereckson) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T2000). Please do the needful. [20:00:14] no parsoid deploy today [20:09:34] (03CR) 10Gehel: Set elasticsearch servers to use 128kB readahead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [20:35:06] !log ppchelko@tin Started deploy [restbase/deploy@4d6c706]: Temporary fallback to the new storage buckets before truncation [20:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:41] !log ppchelko@tin Finished deploy [restbase/deploy@4d6c706]: Temporary fallback to the new storage buckets before truncation (duration: 09m 35s) [20:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T2100). [21:00:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.132 second response time [21:05:34] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.03 seconds [21:05:54] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.75 seconds [21:07:02] its happening again.... [21:07:24] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3525085 (10Addshore) It looks like something just happened on db1082 again? at roughly 16:54 edits on wikidata dropped off again and there was a spike in the slave lag. > IRC echo bot... [21:07:34] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:07:54] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 15.35 seconds [21:08:35] and its fixed again [21:10:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.183 second response time [21:18:39] addgoat: isn't that only codfw, so not really noticeable at production? [21:19:11] hmmm both db2038 & db2052 are codfw? *chekcs* [21:19:22] edit rate on wikidata.org dropped to 0 for a couple of mins [21:19:58] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3525092 (10Multichill) https://www.flickr.com/services/api/flickr.photos.search.html is probably the api function needed to generate the list of pictures [21:19:59] oh wait db2___ of course that is codfw [21:20:01] addgoat: AFAIK all starting with 1 are eqiad and all with 2 are codfw [21:20:05] yeah [21:20:20] addgoat: so what you need is here I guess: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops [21:21:09] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3525094 (10Multichill) [21:30:34] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3525130 (10Multichill) More about this at https://www.flickr.com/groups/flickr10photowalks/discuss/72157687259285146/ [21:45:54] (03PS2) 10Dereckson: Allow to regenerate computed dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) [21:48:02] (03CR) 10jerkins-bot: [V: 04-1] Allow to regenerate computed dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) (owner: 10Dereckson) [21:49:55] (03PS3) 10Dereckson: Allow to regenerate computed dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371968 (https://phabricator.wikimedia.org/T173342) [22:01:43] If there is no objection, I'll do the evening SWAT a little sooner to merge Pchelolo changes (backfixes for EventBus at the version still used by wikidata). [22:02:48] no objections from me obviously [22:03:26] Pchelolo: your changes are live on mwdebug1002 [22:04:20] no way to actually test is as the job production is not enabled on wikidata yet because it was blocked by these 2 [22:04:31] so it's pretty much a no-op right now [22:05:00] if I pull on Terbium the change [22:05:07] you can't run the job? [22:05:47] (I don't know if some manual script can ask to process the queue) [22:06:27] no Dereckson it's explicitly switched off right now for wikidata [22:06:32] ok [22:06:43] and later you'll monitor it when reenabling it? [22:07:45] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3525153 (10Marostegui) I'm on a plane but the lag is probably due to the massive spike in UPDATES the master (db1063) had [22:08:49] Dereckson: yup, will do reenabling on next morning SWAT [22:09:08] (want more time for monitoring after reenabling) [22:09:56] normally, in such case it's better to deploy both at the same time (excepted if you expect some benefit to have it deployed now) [22:10:44] hm we can wait for next morning SWAT then [22:12:02] seems better, as we frown upon non tested code in prod (yes it seems no-op to add a parameter but...) [22:13:02] it's not non-tested, it's included in wmf.12 so it's been sitting in prod for weeks now [22:13:12] just not on wikidata [22:13:51] ok let's deploy it in this case, that will avoid revert changes to resubmit tommorow morning, and as the code isn't active on Wikidata, we can revert later without any impact [22:15:11] syncing [22:15:55] !log dereckson@tin Synchronized php-1.30.0-wmf.11/extensions/EventBus/JobQueueEventBus.php: JobQueueEventBus: Populate the database field + not set properties are accessed (duration: 00m 48s) [22:16:06] Great, thank you! [22:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:26] You're welcome. [22:16:35] !log Previous log entry is related to [[Gerrit:371960]] and [[Gerrit:371961]]. [22:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:06] Pchelolo: I prepared a change for event-schemas with reference to the correct EventBus commit [22:22:26] so you'll be ready for tommorow [22:22:47] great, thank you. I'll add appropriate reviewers [22:23:00] Not sure about tomorrow, it's the WMF holiday [22:23:51] ah yes, Assumption [22:26:04] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2028505 [22:28:12] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3525196 (10Papaul) Hi Papaul, I had a chance to review the logs you provided and found the following: - Previous entries in SEL indicate memory errors - Replacement 32GB DIMM was dispatched on previous case... [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170814T2300). [23:00:04] Pchelolo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:02:23] Pchelolo, do you need any further deployments? [23:02:57] no MaxSem all is handled, thank you