[08:57:34] (03PS1) 10Giuseppe Lavagetto: puppet: disable hiera autolookup [puppet] - 10https://gerrit.wikimedia.org/r/380304 [10:50:39] hey, zuul is completely down: https://integration.wikimedia.org/zuul/ [11:32:02] (03CR) 10MarcoAurelio: Add 'eliminator' as a priviliged account (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379985 (https://phabricator.wikimedia.org/T176554) (owner: 10Ladsgroup) [11:37:38] (03PS3) 10Ladsgroup: Add 'eliminator' as a priviliged account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379985 (https://phabricator.wikimedia.org/T176554) [11:43:26] (03CR) 10MarcoAurelio: "Technically LGTM although, and I accept it might be looping on the loop, I'd have sorted the list of permissions alphabetically :) Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379980 (https://phabricator.wikimedia.org/T176553) (owner: 10Ladsgroup) [11:46:38] (03CR) 10MarcoAurelio: [C: 031] "Looks good to me :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379985 (https://phabricator.wikimedia.org/T176554) (owner: 10Ladsgroup) [11:58:48] !log restarted ircecho on einsteinium [11:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:27] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag could not connect [12:08:13] (03PS11) 10MarcoAurelio: Cloud VPS configuration for hi.wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/371096 (https://phabricator.wikimedia.org/T173013) [12:13:33] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630077 (10Volans) [12:13:48] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630090 (10Volans) p:05Triage>03High [12:14:33] ACKNOWLEDGEMENT - MariaDB Slave IO: s7 on db2047 is CRITICAL: CRITICAL slave_io_state could not connect Volans db2047 got rebooted, see https://phabricator.wikimedia.org/T176573 [12:14:34] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag could not connect Volans db2047 got rebooted, see https://phabricator.wikimedia.org/T176573 [12:14:34] ACKNOWLEDGEMENT - MariaDB Slave SQL: s7 on db2047 is CRITICAL: CRITICAL slave_sql_state could not connect Volans db2047 got rebooted, see https://phabricator.wikimedia.org/T176573 [12:14:38] ACKNOWLEDGEMENT - mysqld processes on db2047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Volans db2047 got rebooted, see https://phabricator.wikimedia.org/T176573 [13:53:19] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630137 (10jcrespo) Logs: ``` 769 Informational iLO 4 09/24/2017 11:49 09/24/2017 11:49 1 On-board clock set; was 09/24/2017 11:32:42. 768 Caution iLO 4 09/24/2017 11:30 09/24/2017 11:30 1 Server reset. 767 Informational... [13:55:42] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630140 (10jcrespo) Actually: ``` Critical Environment 09/24/2017 11:30 09/24/2017 11:30 1 Critical Temperature Threshold Exceeded (Temperature Sensor 17, Location System, Temperature 127C) ``` 127C, is the datacenter in f... [14:00:39] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630142 (10Volans) @jcrespo interesting, I guess the documentation in https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0#Show_system_event_log_entries needs to be updataed to include the command to sh... [14:04:23] 10Operations, 10DBA: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3630144 (10jcrespo) Past hardware issue: T132011 [14:05:27] (03PS2) 10Jcrespo: Pool db1055 with full weight, remove main traffic from rc replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379757 [14:05:29] (03PS1) 10Jcrespo: mariadb: Depool db2047, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380311 (https://phabricator.wikimedia.org/T176573) [14:07:38] (03PS2) 10Jcrespo: mariadb: Depool db2047, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380311 (https://phabricator.wikimedia.org/T176573) [14:10:57] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2047, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380311 (https://phabricator.wikimedia.org/T176573) (owner: 10Jcrespo) [14:14:26] (03Merged) 10jenkins-bot: mariadb: Depool db2047, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380311 (https://phabricator.wikimedia.org/T176573) (owner: 10Jcrespo) [14:16:08] (03CR) 10jenkins-bot: mariadb: Depool db2047, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380311 (https://phabricator.wikimedia.org/T176573) (owner: 10Jcrespo) [14:16:24] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2047 (duration: 00m 48s) [14:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:37] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [14:23:54] !log restarting db2047 for kernel upgrade [14:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2047, hardware issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380314 [15:16:07] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:17:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:18:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:27:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:27:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:53:27] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [17:00:54] (03PS1) 10BryanDavis: tools: Disable IPv6 for static reverse provy [puppet] - 10https://gerrit.wikimedia.org/r/380318 [17:03:41] (03CR) 10BryanDavis: "Intended to fix errors like:" [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [17:09:48] (03PS2) 10BryanDavis: tools: Disable IPv6 lookup for static reverse proxy backends [puppet] - 10https://gerrit.wikimedia.org/r/380318 [17:41:07] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:08] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:21:27] PROBLEM - puppet last run on db1104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:46:34] (03PS3) 10Merlijn van Deen: tools: Disable IPv6 lookup for static reverse proxy backends [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [18:49:37] RECOVERY - puppet last run on db1104 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:52:04] (03CR) 10Merlijn van Deen: "I'm not 100% sure whether the change will work -- https://trac.nginx.org/nginx/ticket/723 suggests this needs to be configured in the sys" [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [18:56:18] (03CR) 10BryanDavis: "The more radical change would be to disable IPv6 /etc/sysctl.conf for vms in Cloud VPS since we know that none of them have IPv6 connectiv" [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [19:10:52] (03CR) 10Merlijn van Deen: "Also a reasonable option. Another option is changing gai.conf ( https://askubuntu.com/questions/32298/prefer-a-ipv4-dns-lookups-before-aaa" [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis)